Introduction to Histograms Presented By



Yüklə 464 b.
tarix06.05.2018
ölçüsü464 b.
#43141


Introduction to Histograms

  • Presented By:

  • Laukik Chitnis

  • (lchitnis@cise.ufl.edu)


This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala {yannis,viswanath}@cs.wisc.edu

  • This presentation is mostly based on the following work done by Yannis Ioannidis and Viswanath Poosala {yannis,viswanath}@cs.wisc.edu

    • Histogram Based Solutions to Diverse Database Estimation Problems (1995)
    • Improved Histograms for Selectivity Estimation of Range Predicates (1996) (with IBM Almaden research guys Peter Haas and Eugene Shekita)
    • History of Histograms (presented by Yannis)


Motivation and Expected Features

  • Why Histograms?

    • Size estimates needed for
      • Estimating the costs of access plans
    • Query profiling for user feedback
    • Load balancing for parallel join execution
  • Expected features

    • Should produce estimates with small errors
    • Inexpensive to construct, use and maintain
    • Utility across diverse estimation problems


Histograms as approximations of data distribution

  • Histograms as approximations of data distribution

  • Data distribution is a set of (attribute value, frequency) pairs



Set of (attribute value, frequency) pairs

  • Set of (attribute value, frequency) pairs

  • This data distribution has all the information required to answer query (count, join, aggregate,..)

  • But, it is too bulky!

  • So, we “approximate” it to a histogram!









equi-width histograms

  • equi-width histograms

    • A histogram type wherein the ‘spread’ of each bucket is same.
  • Spread S and Area A

    • Si = Vi+1 – Vi
    • Ai = fi * Si
    • where Vi is an attribute value fi is its frequency


Spread S and Area A



Divide the value axis into buckets of equal ‘width’ (range equalized)

  • Divide the value axis into buckets of equal ‘width’ (range equalized)

  • Advantages:

    • natural order of the attribute-values is preserved.
    • Light on storage requirements
  • Example: Count of tuples having x<5

  • Another example: What if the query range boundary does not match the bucket boundary?

  • Scale the last bucket!

  • Assumption: uniform distribution within a bucket!

  • Disadvantages: High variance!

    • Difficult to estimate errors
    • example: Our previous graph was ‘continuous’. What if we get crazy graphs? Big guys besides short guys!
    • Consider self-join with this histogram


Don’t equalize ranges of values but number of tuples in bucket

  • Don’t equalize ranges of values but number of tuples in bucket

  • equi-depth histograms

  • Disadvantages:

  • Work well for range queries only when the data distribution has low skew



Minimize variance – group the items having similar frequencies in a bucket

  • Minimize variance – group the items having similar frequencies in a bucket





The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket

  • The frequencies of the attribute values associated with each bucket are either all greater than or all less than the frequencies of the attribute values associated with other bucket

  • Advantage: Optimal for reducing errors in estimation

  • Disadvantage: Storage requirement high

    • No order-correlation between values and frequencies  index required leading to approximate frequency of every individual attribute value
    • Optimal for equality joins and selection only when attribute value list is maintained for each bucket
  • Can we reduce the storage requirement?



Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets

  • Some of the highest frequencies and some lowest frequencies are explicitly and accurately maintained in separate individual buckets

  • Remaining (middle) frequencies are all approximated together in a single bucket

  • Storage requirement: very little and no index required

    • Only store attribute values in individual buckets and their corresponding frequencies


Definition

  • Definition

    • A histogram on attribute is constructed by partitioning the data distribution D into mutually disjoint β subsets called buckets and approximating the frequencies f and values V in each bucket in some common fashion.


that characterize histograms and determine their effectiveness in query result size estimation.

  • that characterize histograms and determine their effectiveness in query result size estimation.

    • Partition class
    • Sort parameter
    • Partition constraint
    • Source parameter
  • These properties are mutually orthogonal and form the basis for a general taxonomy of histograms.



Partition Class

  • Partition Class

    • Indicates restrictions on partitioning
      • Serial: non-overlapping ranges of sort parameter values
      • End-biased: at most one non-singleton bucket
  • Partition Constraint

    • The mathematical constraint that uniquely identifies the histogram within its partition class.
      • Equi-depth: sum of number of tuples in a bucket should be equal


Sort parameter and Source parameter

  • Sort parameter and Source parameter

    • Derivative of data distribution element (its value and/or frequency)
    • Serial: buckets must contain contiguous sort parameter values


Approximation of values within a bucket

  • Approximation of values within a bucket

    • The assumptions that determine the approximate values within a bucket
      • Uniform distribution of values within a bucket
  • Approximation of frequency of a value within a bucket

    • assumptions that determine the approximate frequency of each value within a bucket
      • Frequency of each value assumed to be arithmetic mean




Trivial Histogram

  • Trivial Histogram

  • Equi-sum(V,S) alias Equi-width

  • Equi-sum(V,F) alias Equi-depth

  • V-optimal (F,F)

  • V-optimal-end-biased (F,F)

  • Spline-based(V,C)



Trivial Histogram

  • Trivial Histogram

    • All values in a single bucket
  • Equi-sum(V,S) alias Equi-width

    • group contiguous ranges of attribute values into buckets, and the sum of the spreads in each bucket (i.e., the maximum minus the minimum value in the bucket) is approximately equal to times 1/ β times the domain range


Equi-sum(V,F) alias Equi-depth

  • Equi-sum(V,F) alias Equi-depth

    • are like equi-width histograms but have the sum of the frequencies in each bucket be equal rather than the sum of the spreads
  • V-optimal(F,F) histograms

    • group contiguous sets of frequencies into buckets so as to minimize the variance of the overall frequency approximation.
    • Were simply called serial histograms before




Performance

  • Performance

    • Frequency-based much better than traditional
    • End-based almost as good as serial
  • Construction cost

    • Serial histograms are exponential in the number of buckets whereas end-based are almost linear
  • Complexity in storage and usage

    • Serial has much higher complexity than end-based


Max-diff and compressed as partition constraint

  • Max-diff and compressed as partition constraint



Max-diff

  • Max-diff

    • a bucket boundary between two source parameter values that are adjacent (in sort parameter order) if the difference between these values is one of the β -1 largest such differences.
    • Goal is to avoid grouping attribute values with vastly different source parameter values into a bucket
    • Efficiently constructed by first computing differences between adjacent* source parameters
  • Compressed

    • Highest source values are stored separately in singleton buckets; the rest are partitioned as in an equi-sum histogram.
    • Achieve great accuracy in approximating skewed frequency distributions and/or non-uniform spreads






Yüklə 464 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə