Dedupe events
Deduplicate logs passing through a topology
Configuration
Example configurations
{
"transforms": {
"my_transform_id": {
"type": "dedupe",
"inputs": [
"my-source-or-transform-id"
]
}
}
}
[transforms.my_transform_id]
type = "dedupe"
inputs = [ "my-source-or-transform-id" ]
transforms:
my_transform_id:
type: dedupe
inputs:
- my-source-or-transform-id
{
"transforms": {
"my_transform_id": {
"type": "dedupe",
"inputs": [
"my-source-or-transform-id"
]
}
}
}
[transforms.my_transform_id]
type = "dedupe"
inputs = [ "my-source-or-transform-id" ]
transforms:
my_transform_id:
type: dedupe
inputs:
- my-source-or-transform-id
cache
optional objectcache.num_events
optional uint5000
fields
optional objectOptions to control what fields to match against.
When no field matching configuration is specified, events are matched using the timestamp
,
host
, and message
fields from an event. The specific field names used are those set in
the global log schema
configuration.
fields.ignore
required [string]fields.match
required [string]graph
optional objectExtra graph configuration
Configure output for component when generated with graph command
graph.node_attributes
optional objectNode attributes to add to this component’s node in resulting graph
They are added to the node as provided
graph.node_attributes.*
required string literalinputs
required [string]A list of upstream source or transform IDs.
Wildcards (*
) are supported.
See configuration for more info.
Outputs
<component_id>
Telemetry
Metrics
linkcomponent_discarded_events_total
counterfilter
transform, or false if due to an error.component_errors_total
countercomponent_received_event_bytes_total
countercomponent_received_events_count
histogramA histogram of the number of events passed in each internal batch in Vector’s internal topology.
Note that this is separate than sink-level batching. It is mostly useful for low level debugging performance issues in Vector due to small internal batches.
component_received_events_total
countercomponent_sent_event_bytes_total
countercomponent_sent_events_total
counterutilization
gaugeHow it works
Cache Behavior
cache.num_events
.
That means that this transform will cache information in memory for
the last cache.num_events
Events that it has processed. Entries
will be removed from the cache in the order they were inserted. If
an Event is received that is considered a duplicate of an Event
already in the cache that will put that event back to the head of
the cache and reset its place in line, making it once again last
entry in line to be evicted.Memory Usage Details
fields.match
this will
be the list of fields specified in that configuration option. When
using fields.ignore
that will include all fields present in the
incoming event except those specified in fields.ignore
. Each entry
also uses a single byte per field to store the type information of
that field. When using fields.ignore
each cache entry additionally
stores a copy of each field name being considered for matching. When
using fields.match
storing the field names is not necessary.Memory Utilization Estimation
If you want to estimate the memory requirements of this transform for your dataset, you can do so with these formulas:
When using fields.match
:
Sum(the average size of the *data* (but not including the field name) for each field in `fields.match`) * `cache.num_events`
When using fields.ignore
:
(Sum(the average size of each incoming Event) - (the average size of the field name *and* value for each field in `fields.ignore`)) * `cache.num_events`