Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	63/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 57 58 59 60 61 62 63 64 65

Glossary and Index
237

GLOSSARY AND INDEX

This glossary contains key words, bolded throughout the text, and their definitions as they are used
for  the  purposes  of  this  book.    The  page  number  listed  is  not  the  only  page  where  the  term  is
found,  but  it  is  the  page  where  the  term  is  introduced  or  where  it  is  primarily  defined  and
discussed.

Antecedent:    In  an  association  rules  data  mining  model,  the  antecedent  is  the  attribute  which
precedes the consequent in an identified rule.  Attribute order makes a difference when calculating
the  confidence  percentage,  so  identifying  which  attribute  comes  first  is  necessary  even  if  the
reciprocal of the association is also a rule. (Page 85)

Archived Data:  Data which have been copied out of a live production database and into a data
warehouse  or  other  permanent  system  where  they  can  be  accessed  and  analyzed,  but  not  by
primary operational business systems. (Page 18)

Association Rules:  A data mining methodology which compares attributes in a data set across all
observations to identify areas where two or more attributes are frequently found together.  If their
frequency of coexistence is high enough throughout the data set, the association of those attributes
can be said to be a rule. (Page 74)

Attribute:  In columnar data, an attribute is one column.  It is named in the data so that it can be
referred to by a model and used in data mining. The term attribute is sometimes interchanged with
the terms ‘field’, ‘variable’, or ‘column’. (Page 16)

Average:  The arithmetic mean, calculated by summing all values and dividing by the count of the
values. (Pages 47, 77)

Data Mining for the Masses
238
Binomial:  A data type for any set of values that is limited to one of two numeric options.  (Page
80)

Binominal:  In RapidMiner, the data type binominal is used instead of binomial, enabling both
numerical and character-based sets of values that are limited to one of two options. (Page 80)

Business Understanding:  See Organizational Understanding. (Page 6)

Case:  See Observation. (Page 16)

Case Sensitive:  A  situation  where  a  computer  program  recognizes  the  uppercase  version  of  a
letter or word as being different from the lowercase version of the same letter or word. (Page 199)

Classification:  One of the two main goals of conducting data mining activities, with the other
being  prediction.    Classification  creates  groupings  in  a  data  set  based  on  the  similarity  of  the
observations’ attributes.  Some data mining methodologies, such as decision trees, can predict an
observation’s classification. (Page 9)

Code:    Code  is  the  result  of  a  computer  worker’s  work.    It  is  a  set  of  instructions,  typed  in  a
specific grammar and syntax, that a computer can understand and execute. According to Lawrence
Lessig, it is one of four methods humans can use to set and control boundaries for behavior when
interacting with computer systems. (Page 233)

Coefficient:  In data mining, a coefficient is a value that is calculated based on the values in a data
set that can be used as a multiplier or as an indicator of the relative strength of some attribute or
component in a data mining model. (Page 63)

Column:  See Attribute. (Page 16)

Comma  Separated  Values  (CSV):    A  common  text-based  format  for  data  sets  where  the
divisions  between  attributes  (columns  of  data)  are  indicated  by  commas.    If  commas  occur
naturally in some of the values in the data set, a CSV file will misunderstand these to be attribute
separators, leading to misalignment of attributes. (Page 35)

Glossary and Index
239
Conclusion:  See Consequent. (Page 85)

Confidence (Alpha) Level:  A value, usually 5% or 0.05, used to test for statistical significance in
some data mining methods.  If statistical significance is found, a data miner can say that there is a
95% likelihood that a calculated or predicted value is not a false positive. (Page 132)

Confidence Percent:  In predictive data mining, this is the percent of calculated confidence that
the  model  has  calculated  for  one  or  more  possible  predicted  values.    It  is  a  measure  for  the
likelihood of false positives in predictions.  Regardless of the number of possible predicted values,
their collective confidence percentages will always total to 100%. (Page 84)

Consequent:    In  an  association  rules  data  mining  model,  the  consequent  is  the  attribute  which
results from the antecedent in an identified rule.  If an association rule were characterized as “If
this, then that”, the consequent would be that—in other words, the outcome. (Page 85)

Correlation:    A  statistical  measure  of  the  strength  of  affinity,  based  on  the  similarity  of
observational values, of the attributes in a data set.  These can be positive (as one attribute’s values
go  up  or  down,  so  too  does  the  correlated  attribute’s  values);  or  negative  (correlated  attributes’
values move in opposite directions).  Correlations are indicated by coefficients which fall on a scale
between -1 (complete negative correlation) and 1 (complete positive correlation), with 0 indicating
no correlation at all between two attributes. (Page 59)

CRISP-DM:  An acronym for Cross-Industry Standard Process for Data  Mining.  This process
was  jointly  developed  by  several  major  multi-national  corporations  around  the  turn  of  the  new
millennium  in  order  to  standardize  the approach  to  mining  data.    It  is  comprised  of  six  cyclical
steps:  Business (Organizational) Understanding, Data Understanding, Data Preparation, Modeling,
Evaluation, Deployment. (Page 5)

Cross-validation:    A  method  of  statistically  evaluating  a  training  data  set  for  its  likelihood  of
producing false positives in a predictive data mining model. (Page 221).

Data:  Data are any arrangement and compilation of facts.  Data may be structured (e.g. arranged
in columns (attributes) and rows (observations)), or unstructured (e.g. paragraphs of text, computer
log file). (Page 3)

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 57 58 59 60 61 62 63 64 65