Dedupe events
Deduplicate logs passing through a topology
status: stable
egress: stream
state: stateful
Deduplicates events to reduce data volume by eliminating copies of data.
Configuration
Example configurations
{
"transforms": {
"my_transform_id": {
"type": "dedupe",
"inputs": [
"my-source-or-transform-id"
],
"fields": null
}
}
}
[transforms.my_transform_id]
type = "dedupe"
inputs = [ "my-source-or-transform-id" ]
---
transforms:
my_transform_id:
type: dedupe
inputs:
- my-source-or-transform-id
fields: null
{
"transforms": {
"my_transform_id": {
"type": "dedupe",
"inputs": [
"my-source-or-transform-id"
],
"cache": null,
"fields": null
}
}
}
[transforms.my_transform_id]
type = "dedupe"
inputs = [ "my-source-or-transform-id" ]
---
transforms:
my_transform_id:
type: dedupe
inputs:
- my-source-or-transform-id
cache: null
fields: null
cache
optional objectOptions controlling how we cache recent Events for future duplicate checking.
cache.num_events
optional uintThe number of recent Events to cache and compare new incoming Events against.
default:
5000
fields
required objectOptions controlling what fields to match against.
fields.ignore
optional [string]The field names to ignore when deciding if an Event is a duplicate. Incompatible with the
fields.match
option.fields.match
optional [string]The field names considered when deciding if an Event is a duplicate. This can also be globally set via the global
log_schema
options. Incompatible with the fields.ignore
option.default:
[timestamp host message]
inputs
required [string]A list of upstream source or transform
IDs. Wildcards (*
) are supported.
See configuration for more info.
Outputs
<component_id>
Default output stream of the component. Use this component’s ID as an input to downstream transforms and sinks.
Telemetry
Metrics
linkcomponent_received_event_bytes_total
counterThe number of event bytes accepted by this component either from
tagged origins like file and uri, or cumulatively from other origins.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
container_name
optional
The name of the container from which the data originated.
file
optional
The file from which the data originated.
host
optional
The hostname of the system Vector is running on.
mode
optional
The connection mode used by the component.
peer_addr
optional
The IP from which the data originated.
peer_path
optional
The pathname from which the data originated.
pid
optional
The process ID of the Vector instance.
pod_name
optional
The name of the pod from which the data originated.
uri
optional
The sanitized URI from which the data originated.
component_received_events_total
counterThe number of events accepted by this component either from tagged
origins like file and uri, or cumulatively from other origins.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
container_name
optional
The name of the container from which the data originated.
file
optional
The file from which the data originated.
host
optional
The hostname of the system Vector is running on.
mode
optional
The connection mode used by the component.
peer_addr
optional
The IP from which the data originated.
peer_path
optional
The pathname from which the data originated.
pid
optional
The process ID of the Vector instance.
pod_name
optional
The name of the pod from which the data originated.
uri
optional
The sanitized URI from which the data originated.
component_sent_event_bytes_total
counterThe total number of event bytes emitted by this component.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
host
optional
The hostname of the system Vector is running on.
output
optional
The specific output of the component.
pid
optional
The process ID of the Vector instance.
component_sent_events_total
counterThe total number of events emitted by this component.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
host
optional
The hostname of the system Vector is running on.
output
optional
The specific output of the component.
pid
optional
The process ID of the Vector instance.
events_discarded_total
counterThe total number of events discarded by this component.
host
optional
The hostname of the system Vector is running on.
pid
optional
The process ID of the Vector instance.
reason
required
The type of the error
events_in_total
counterThe number of events accepted by this component either from tagged
origins like file and uri, or cumulatively from other origins.
This metric is deprecated and will be removed in a future version.
Use
component_received_events_total
instead.component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
container_name
optional
The name of the container from which the data originated.
file
optional
The file from which the data originated.
host
optional
The hostname of the system Vector is running on.
mode
optional
The connection mode used by the component.
peer_addr
optional
The IP from which the data originated.
peer_path
optional
The pathname from which the data originated.
pid
optional
The process ID of the Vector instance.
pod_name
optional
The name of the pod from which the data originated.
uri
optional
The sanitized URI from which the data originated.
events_out_total
counterThe total number of events emitted by this component.
This metric is deprecated and will be removed in a future version.
Use
component_sent_events_total
instead.component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
host
optional
The hostname of the system Vector is running on.
output
optional
The specific output of the component.
pid
optional
The process ID of the Vector instance.
processed_bytes_total
counterThe number of bytes processed by the component.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
container_name
optional
The name of the container from which the bytes originate.
file
optional
The file from which the bytes originate.
host
optional
The hostname of the system Vector is running on.
mode
optional
The connection mode used by the component.
peer_addr
optional
The IP from which the bytes originate.
peer_path
optional
The pathname from which the bytes originate.
pid
optional
The process ID of the Vector instance.
pod_name
optional
The name of the pod from which the bytes originate.
uri
optional
The sanitized URI from which the bytes originate.
processed_events_total
counterThe total number of events processed by this component.
This metric is deprecated in place of using
component_received_events_total
and
component_sent_events_total
metrics.component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
host
optional
The hostname of the system Vector is running on.
pid
optional
The process ID of the Vector instance.
utilization
gaugeA ratio from 0 to 1 of the load on a component. A value of 0 would indicate a completely idle component that is simply waiting for input. A value of 1 would indicate a that is never idle. This value is updated every 5 seconds.
component_id
required
The Vector component ID.
component_kind
required
The Vector component kind.
component_name
required
Deprecated, use
component_id
instead. The value is the same as component_id
.component_type
required
The Vector component type.
host
optional
The hostname of the system Vector is running on.
pid
optional
The process ID of the Vector instance.
How it works
Cache Behavior
This transform is backed by an LRU cache of size
cache.num_events
.
That means that this transform will cache information in memory for
the last cache.num_events
Events that it has processed. Entries
will be removed from the cache in the order they were inserted. If
an Event is received that is considered a duplicate of an Event
already in the cache that will put that event back to the head of
the cache and reset its place in line, making it once again last
entry in line to be evicted.Memory Usage Details
Each entry in the cache corresponds to an incoming Event and
contains a copy of the ‘value’ data for all fields in the Event
being considered for matching. When using
fields.match
this will
be the list of fields specified in that configuration option. When
using fields.ignore
that will include all fields present in the
incoming event except those specified in fields.ignore
. Each entry
also uses a single byte per field to store the type information of
that field. When using fields.ignore
each cache entry additionally
stores a copy of each field name being considered for matching. When
using fields.match
storing the field names is not necessary.Memory Utilization Estimation
If you want to estimate the memory requirements of this transform for your dataset, you can do so with these formulas:
When using fields.match
:
Sum(the average size of the *data* (but not including the field name) for each field in `fields.match`) * `cache.num_events`
When using fields.ignore
:
(Sum(the average size of each incoming Event) - (the average size of the field name *and* value for each field in `fields.ignore`)) * `cache.num_events`