FIDIS
Future of Identity in the Information Society (No. 507512)
D2.3
[Final], Version: 2.0
File: fidis-wp2-del2.3.models.doc
Page 18
2.3.2 The extraction from data sources and from processes
In this case, the values associated to the attributes originate from two different sources: (1)
databases; and (2) processes.
In the first case, the databases may be governmental (such as police or tax), human resource
databases (enterprise resource planning and knowledge management systems such as payrolls,
or training information) or health file databases (managed by hospitals or by social security
units).
In the second case, the data can originate from a series of processes that can be used to
capture the data (and that will be stored in databases). Examples of such processes include e-
commerce systems (such as Amazon) and fidelity programs that can capture the history of
different transactions associated with each of the customers, or virtual community systems
that can capture the history of activities of the different members (such as age in the
community, and number of posting).
The type 1 IMS (organisational function), presented previously, represents a typical category
of systems that employs this method, although it can also be used in the type 3 IMS
(individual function).
The personal data that is present in databases or captured via a set of processes is mostly
outside the user’s control (the possibilities of correction by the end user are often limited).
These data are also often very regulated by some legislation specifying the type of data that
can be represented, the possible usage of this data, including combining databases.
Even if this mode of collection of personal data appears to be more intrusive to people’s
privacy, it is not without some advantages, even for the people themselves. First, the data
captured via this means can be considered much more reliable, since it directly reflects the
activities of people, and not only the perception of these activities. Second, because this data
collection is automatic, it can be considered less demanding for the end-users.
The values of many attributes that can be recorded in this way include characteristics that
have a certain level of permanence, while other categories of person’s information can
include all the transactions (commercial or not) in which the people have been engaged.
2.3.3 Data calculated and inferred from other attributes
In this case, unknown values associated to particular attributes originate from the calculation
of other attributes (typically the ones that have been extracted from the previous two
methods). This category is relatively similar to the category previous described, however, it
differs in the level of sophistication of the systems that make use of it. Notably, these are
more frequently used in Type 3 IMS (individual function) applications that use it to provide
some level of adaptability (for instance in e-learning systems or e-commerce systems).
The reliability of these calculated attributes is generally less accurate than for non-calculated
attributes. For instance in Amazon the assertion “a customer that has bought a book about
children is interested by children and is likely to buy other books about children” is only
correct in average, since they may only have bought this book once in order to offer a present
to somebody else.
FIDIS
Future of Identity in the Information Society (No. 507512)
D2.3
[Final], Version: 2.0
File: fidis-wp2-del2.3.models.doc
Page 19
The level of control on these calculated attributes is often limited by the simplicity of the
algorithm used, and the way it was configured for the calculation. Thus, people that read the
value of these attributes usually have, at best, only a vague idea about the underlying
principles that have been used. For instance, a calculated attribute could be a level of risk that
a bank could calculate on a particular client, which results from a combination of values of
attributes such as the gross salary of the person, the assets such as real-estates that the person
may own, his family status, or the postal code of his place of living or even his ethnic origin.
Another application is certain e-commerce websites, where the preferences of a customer are
determined automatically.
The extraction of values via data mining techniques could appear similar to the previous
calculated methods. They differ however in that the algorithms are being applied globally to
the data of (very large) groups of people, and not on the data set that is associated with a
single person. The algorithms used are also of a more statistical and probability based nature,
and often rely on the use of Heuristics. Finally, these algorithms may also be used to help the
creation process of the user model itself, and in particular help to determine the set of
attributes required to “summarise” the problem (for instance, in a banking application, an
algorithm may determine that the knowledge of the age and of the postal code information
represent sufficient information to discriminate a reliable customer from an unreliable one,
with a limited risk of error).
Type 2 IMS (profiling function), presented previously, represent a typical category of systems
that employs this method.
The types of attributes that are extracted via mining typically include people related categories
such as social categories or life styles. These attributes can be considered to be more abstract
and less directly associated to the individuals.
At a more micro-level, these attributes can represent some user characteristics and behaviours
that can be automatically extracted from the use of some Information Systems. For instance
such attributes, in the context of an e-commerce system, can reflect reliability characteristics
(likeliness of fraud), and, in the context of a virtual community, can reflect the level of
participation (such as the activity of the people in SourceForge.net).