Glossary
and Index
237
GLOSSARY AND INDEX
This glossary contains key words, bolded throughout the text, and their definitions as they are used
for the purposes of this book. The page number listed is not the only page where the term is
found, but it is the page where the term is introduced or where it is primarily defined and
discussed.
Antecedent: In an association rules data mining model, the antecedent is the attribute which
precedes the consequent in an identified rule. Attribute order makes a difference when calculating
the confidence percentage, so identifying which attribute comes first is necessary even if the
reciprocal of the association is also a rule. (Page 85)
Archived Data: Data which have been copied out of a live production database and into a data
warehouse or other permanent system where they can be accessed and analyzed, but not by
primary operational business systems. (Page 18)
Association Rules: A data mining methodology which compares attributes in a data set across all
observations to identify areas where two or more attributes are frequently found together. If their
frequency of coexistence is high enough throughout the data set, the association of those attributes
can be said to be a rule. (Page 74)
Attribute: In columnar data, an attribute is one column. It is named in the data so that it can be
referred to by a model and used in data mining. The term attribute is sometimes interchanged with
the terms ‘field’, ‘variable’, or ‘column’. (Page 16)
Average: The arithmetic mean, calculated by summing all values and dividing by the count of the
values. (Pages 47, 77)
Data Mining
for the Masses
238
Binomial: A data type for any set of values that is limited to one of two numeric options. (Page
80)
Binominal: In RapidMiner, the data type binominal is used instead of binomial, enabling both
numerical and character-based sets of values that are limited to one of two options. (Page 80)
Business Understanding: See Organizational Understanding. (Page 6)
Case: See Observation. (Page 16)
Case Sensitive: A situation where a computer program recognizes the uppercase version of a
letter or word as being different from the lowercase version of the same letter or word. (Page 199)
Classification: One of the two main goals of conducting data mining activities, with the other
being prediction. Classification creates groupings in a data set based on the similarity of the
observations’ attributes. Some data mining methodologies, such as decision trees, can predict an
observation’s classification. (Page 9)
Code: Code is the result of a computer worker’s work. It is a set of instructions, typed in a
specific grammar and syntax, that a computer can understand and execute.
According to Lawrence
Lessig, it is one of four methods humans can use to set and control boundaries for behavior when
interacting with computer systems. (Page 233)
Coefficient: In data mining, a coefficient is a value that is calculated based on the values in a data
set that can be used as a multiplier or as an indicator of the relative strength of some attribute or
component in a data mining model. (Page 63)
Column: See Attribute. (Page 16)
Comma Separated Values (CSV): A common text-based format for data sets where the
divisions between attributes (columns of data) are indicated by commas. If commas occur
naturally in some of the values in the data set, a CSV file will misunderstand these to be attribute
separators, leading to misalignment of attributes. (Page 35)
Glossary and Index
239
Conclusion: See Consequent. (Page 85)
Confidence (Alpha) Level: A value, usually 5% or 0.05, used to test for statistical significance in
some data mining methods. If statistical significance is found, a data miner can say that there is a
95% likelihood that a calculated or predicted value is not a false positive. (Page 132)
Confidence Percent: In predictive data mining, this is the percent of calculated confidence that
the model has calculated for one or more possible predicted values. It is a measure for the
likelihood of false positives in predictions. Regardless of the number of possible predicted values,
their collective confidence percentages will always total to 100%. (Page 84)
Consequent: In an association rules data mining model, the consequent is the attribute which
results from the antecedent in an identified rule. If an association rule were characterized as “If
this, then
that”,
the consequent would be that—in other words, the outcome. (Page 85)
Correlation: A statistical measure of the strength of affinity, based on the similarity of
observational values, of the attributes in a data set. These can be positive (as one attribute’s values
go up
or down, so too does the correlated attribute’s values); or negative (correlated attributes’
values move in opposite directions). Correlations are indicated by coefficients which fall on
a scale
between -1 (complete negative correlation) and 1 (complete positive correlation), with 0 indicating
no correlation at all between two attributes. (Page 59)
CRISP-DM: An acronym for Cross-Industry Standard Process for Data Mining. This process
was jointly developed by several major multi-national corporations around the turn of the new
millennium in order to standardize the approach to mining data. It is comprised of six cyclical
steps: Business (Organizational) Understanding, Data Understanding, Data Preparation, Modeling,
Evaluation, Deployment. (Page 5)
Cross-validation: A method of statistically evaluating a training data set for its likelihood of
producing false positives in a predictive data mining model. (Page 221).
Data: Data are any arrangement and compilation of facts. Data may be structured (e.g. arranged
in columns (attributes) and rows (observations)), or unstructured (e.g. paragraphs of text,
computer
log file). (Page 3)