Introduction to Histograms Presented By

Introduction to Histograms

This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala {yannis,viswanath}@cs.wisc.edu

Motivation and Expected Features

Histograms as approximations of data distribution

Set of (attribute value, frequency) pairs

equi-width histograms

Spread S and Area A

Divide the value axis into buckets of equal ‘width’ (range equalized)

Don’t equalize ranges of values but number of tuples in bucket

Minimize variance – group the items having similar frequencies in a bucket

The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket

Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets

Definition

that characterize histograms and determine their effectiveness in query result size estimation.

Partition Class

Sort parameter and Source parameter

Approximation of values within a bucket

Trivial Histogram

Trivial Histogram

Equi-sum(V,F) alias Equi-depth

Performance

Max-diff and compressed as partition constraint

Max-diff

Dostları ilə paylaş:

Introduction to Histograms Presented By

Introduction to Histograms

Presented By:

Laukik Chitnis

(lchitnis@cise.ufl.edu)

This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala {yannis,viswanath}@cs.wisc.edu

This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala {yannis,viswanath}@cs.wisc.edu

Motivation and Expected Features

Why Histograms?

Expected features

Histograms as approximations of data distribution

Histograms as approximations of data distribution

Data distribution is a set of (attribute value, frequency) pairs

Set of (attribute value, frequency) pairs

Set of (attribute value, frequency) pairs

This data distribution has all the information required to answer query (count, join, aggregate,..)

But, it is too bulky!

So, we “approximate” it to a histogram!

equi-width histograms

equi-width histograms

Spread S and Area A

Spread S and Area A

Divide the value axis into buckets of equal ‘width’ (range equalized)

Divide the value axis into buckets of equal ‘width’ (range equalized)

Advantages:

Example: Count of tuples having x<5

Another example: What if the query range boundary does not match the bucket boundary?

Scale the last bucket!

Assumption: uniform distribution within a bucket!

Disadvantages: High variance!

Don’t equalize ranges of values but number of tuples in bucket

Don’t equalize ranges of values but number of tuples in bucket

equi-depth histograms

Disadvantages:

Work well for range queries only when the data distribution has low skew

Minimize variance – group the items having similar frequencies in a bucket

Minimize variance – group the items having similar frequencies in a bucket

The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket

The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket

Advantage: Optimal for reducing errors in estimation

Disadvantage: Storage requirement high

Can we reduce the storage requirement?

Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets

Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets

Remaining (middle) frequencies are all approximated together in a single bucket

Storage requirement: very little and no index required

Definition

Definition

that characterize histograms and determine their effectiveness in query result size estimation.

that characterize histograms and determine their effectiveness in query result size estimation.

These properties are mutually orthogonal and form the basis for a general taxonomy of histograms.

Partition Class

Partition Class

Partition Constraint

Sort parameter and Source parameter

Sort parameter and Source parameter

Approximation of values within a bucket

Approximation of values within a bucket

Approximation of frequency of a value within a bucket

Trivial Histogram

Trivial Histogram

Equi-sum(V,S) alias Equi-width

Equi-sum(V,F) alias Equi-depth

V-optimal (F,F)

V-optimal-end-biased (F,F)

Spline-based(V,C)

Trivial Histogram

Trivial Histogram

Equi-sum(V,S) alias Equi-width

Equi-sum(V,F) alias Equi-depth

Equi-sum(V,F) alias Equi-depth

V-optimal(F,F) histograms

Performance

Performance

Construction cost

Complexity in storage and usage

Max-diff and compressed as partition constraint

Max-diff and compressed as partition constraint

Max-diff

Max-diff