Dedupe events Transform
The Vector dedupe
transform
deduplicates events to reduce data volume by eliminating copies of data.
Configuration
- Common
- Advanced
- vector.toml
- vector.yaml
- vector.json
[transforms.my_transform_id]# Generaltype = "dedupe" # requiredinputs = ["my-source-or-transform-id"] # required# Fieldsfields.match = ["timestamp", "host", "message"] # optional, default
- optionaltable
cache
Options controlling how we cache recent Events for future duplicate checking.
- optionaluint
num_events
The number of recent Events to cache and compare new incoming Events against.
- Default:
5000
- Default:
- requiredtable
fields
Options controlling what fields to match against.
- optional[string]
ignore
The field names to ignore when deciding if an Event is a duplicate. Incompatible with the
fields.match
option.- View examples
- optional[string]
match
The field names considered when deciding if an Event is a duplicate. This can also be globally set via the global
log_schema
options. Incompatible with thefields.ignore
option.- Default:
["timestamp","host","message"]
- View examples
- Default:
Telemetry
This component provides the following metrics that can be retrieved through
the internal_metrics
source. See the
metrics section in the
monitoring page for more info.
- counter
events_discarded_total
The total number of events discarded by this component. This metric includes the following tags:
instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
processed_events_total
The total number of events processed by this component. This metric includes the following tags:
component_kind
- The Vector component kind.component_name
- The Vector component ID.component_type
- The Vector component type.file
- The file that produced the errorinstance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
- counter
processed_bytes_total
The total number of bytes processed by the component. This metric includes the following tags:
component_kind
- The Vector component kind.component_name
- The Vector component ID.component_type
- The Vector component type.instance
- The Vector instance identified by host and port.job
- The name of the job producing Vector metrics.
How It Works
Cache Behavior
This transform is backed by an LRU cache of size cache.num_events
.
That means that this transform will cache information in memory for
the last cache.num_events
Events that it has processed. Entries
will be removed from the cache in the order they were inserted. If
an Event is received that is considered a duplicate of an Event
already in the cache that will put that event back to the head of
the cache and reset its place in line, making it once again last
entry in line to be evicted.
Memory Usage Details
Each entry in the cache corresponds to an incoming Event and
contains a copy of the 'value' data for all fields in the Event
being considered for matching. When using fields.match
this will
be the list of fields specified in that configuration option. When
using fields.ignore
that will include all fields present in the
incoming event except those specified in fields.ignore
. Each entry
also uses a single byte per field to store the type information of
that field. When using fields.ignore
each cache entry additionally
stores a copy of each field name being considered for matching. When
using fields.match
storing the field names is not necessary.
Memory Utilization Estimation
If you want to estimate the memory requirements of this transform for your dataset, you can do so with these formulas:
When using fields.match
:
Sum(the average size of the *data* (but not including the field name) for each field in `fields.match`) * `cache.num_events`
When using fields.ignore
:
(Sum(the average size of each incoming Event) - (the average size of the field name *and* value for each field in `fields.ignore`)) * `cache.num_events`
Missing Fields
Fields with explicit null values will always be considered different
than if that field was omitted entirely. For example, if you run
this transform with fields.match = ["a"]
, the event "{a: null,
b:5}" will be considered different to the event "{b:5}".