Title: | Wicked Fast, Accurate Quantiles Using t-Digests |
---|---|
Description: | The t-Digest construction algorithm, by Dunning et al., (2019) <doi:10.48550/arXiv.1902.04023>, uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-Digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-Digest over previous digests for this purpose is that the t-Digest handles data with full floating point resolution. The accuracy of quantile estimates produced by t-Digests can be orders of magnitude more accurate than those produced by previous digest algorithms. Methods are provided to create and update t-Digests and retrieve quantiles from the accumulated distributions. |
Authors: | Bob Rudis [aut, cre] , Ted Dunning [aut] (t-Digest algorithm; <https://github.com/tdunning/t-digest/>), Andrew Werner [aut] (Original C+ code; <https://github.com/ajwerner/tdigest>) |
Maintainer: | Bob Rudis <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.2 |
Built: | 2024-10-17 05:15:49 UTC |
Source: | https://github.com/hrbrmstr/tdigest |
These functions make it possible to create & populate a tdigest, serialize it out, read it in at a later time and continue populating it enabling compact distribution accumulation & storage for large, "continuous" datasets.
## S3 method for class 'tdigest' as.list(x, ...) as_tdigest(x)
## S3 method for class 'tdigest' as.list(x, ...) as_tdigest(x)
x |
a tdigest object or a tdigest_list object |
... |
unused |
set.seed(1492) x <- sample(0:100, 1000000, replace = TRUE) td <- tdigest(x, 1000) as_tdigest(as.list(td))
set.seed(1492) x <- sample(0:100, 1000000, replace = TRUE) td <- tdigest(x, 1000) as_tdigest(as.list(td))
Add a value to the t-Digest with the specified count
td_add(td, val, count)
td_add(td, val, count)
td |
t-Digest object |
val |
value |
count |
count |
the original, updated tdigest
object
td <- td_create(10) td_add(td, 0, 1)
td <- td_create(10) td_add(td, 0, 1)
Allocate a new histogram
td_create(compression = 100) is_tdigest(td)
td_create(compression = 100) is_tdigest(td)
compression |
the input compression value; should be >= 1.0; this will control how aggressively the t-Digest compresses data together. The original t-Digest paper suggests using a value of 100 for a good balance between precision and efficiency. It will land at very small (think like 1e-6 percentile points) errors at extreme points in the distribution, and compression ratios of around 500 for large data sets (~1 million datapoints). Defaults to 100. |
td |
t-digest object |
a tdigest
object
Computing Extremely Accurate Quantiles Using t-Digests
td <- td_create(10)
td <- td_create(10)
Merge one t-Digest into another
td_merge(from, into)
td_merge(from, into)
from , into
|
t-Digests |
into
a tdigest
object
Return the quantile of the value
td_quantile_of(td, val)
td_quantile_of(td, val)
td |
t-Digest object |
val |
value |
the computed quantile (double
)
Total items contained in the t-Digest
td_total_count(td) ## S3 method for class 'tdigest' length(x)
td_total_count(td) ## S3 method for class 'tdigest' length(x)
td |
t-Digest object |
x |
a tdigest object |
double
containing the size of the t-Digest
td <- td_create(10) td_add(td, 0, 1) td_total_count(td) length(td)
td <- td_create(10) td_add(td, 0, 1) td_total_count(td) length(td)
Return the value at the specified quantile
td_value_at(td, q) ## S3 method for class 'tdigest' x[i, ...]
td_value_at(td, q) ## S3 method for class 'tdigest' x[i, ...]
td |
t-Digest object |
q |
quantile (range 0:1) |
x |
a tdigest object |
i |
quantile (range 0:1) |
... |
unused |
the computed quantile (double
)
td <- td_create(10) td_add(td, 0, 1) %>% td_add(10, 1) td_value_at(td, 0.1) td_value_at(td, 0.5) td[0.1] td[0.5]
td <- td_create(10) td_add(td, 0, 1) %>% td_add(10, 1) td_value_at(td, 0.1) td_value_at(td, 0.5) td[0.1] td[0.5]
Calculate sample quantiles from a t-Digest
tquantile(td, probs) ## S3 method for class 'tdigest' quantile(x, probs = seq(0, 1, 0.25), ...)
tquantile(td, probs) ## S3 method for class 'tdigest' quantile(x, probs = seq(0, 1, 0.25), ...)
td |
t-Digest object |
probs |
numeric vector of probabilities with values in range 0:1 |
x |
numeric vector whose sample quantiles are wanted |
... |
unused |
a numeric
vector containing the requested quantile values
Computing Extremely Accurate Quantiles Using t-Digests
set.seed(1492) x <- sample(0:100, 1000000, replace = TRUE) td <- tdigest(x, 1000) tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) quantile(td)
set.seed(1492) x <- sample(0:100, 1000000, replace = TRUE) td <- tdigest(x, 1000) tquantile(td, c(0, .01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 1)) quantile(td)