Tokenizer Transform

The Vector tokenizer transform accepts and outputs log events allowing you to tokenize a field's value by splitting on white space, ignoring special wrapping characters, and zip the tokens into ordered field names.

Configuration

vector.toml
[transforms.my_transform_id]
# General
type = "tokenizer" # required
inputs = ["my-source-id"] # required
drop_field = true # optional, default
field = "message" # optional, default
field_names = ["timestamp", "level", "message", "parent.child"] # required
# Types
types.status = "int" # example
types.duration = "float" # example
types.success = "bool" # example
types.timestamp = "timestamp|%F" # example
types.timestamp = "timestamp|%a %b %e %T %Y" # example
types.parent.child = "int" # example
  • boolcommonoptional

    drop_field

    If true the field will be dropped after parsing.

    • Default: true
    • View examples
  • stringcommonoptional

    field

    The log field to tokenize.

    See Field Notation Syntax for more info.

    • Default: "message"
    • View examples
  • [string]commonrequired

    field_names

    The log field names assigned to the resulting tokens, in order.

    See Field Notation Syntax for more info.

    • No default
    • View examples
  • tablecommonoptional

    types

    Key/value pairs representing mapped log field names and types. This is used to coerce log fields into their proper types.

    • stringenumcommonoptional

      [field-name]

      A definition of log field type conversions. They key is the log field name and the value is the type. strptime specifiers are supported for the timestamp type.

      • No default
      • Enum, must be one of: "bool" "float" "int" "string" "timestamp"
      • View examples

Examples

Given the following log line:

log event
{
"message": "5.86.210.12 - zieme4647 [19/06/2019:17:20:49 -0400] "GET /embrace/supply-chains/dynamic/vertical" 201 20574"
}

And the following configuration:

vector.toml
[transforms.<transform-id>]
type = "tokenizer"
field = "message"
fields = ["remote_addr", "ident", "user_id", "timestamp", "message", "status", "bytes"]

A log event will be output with the following structure:

log event
{
// ... existing fields
"remote_addr": "5.86.210.12",
"user_id": "zieme4647",
"timestamp": "19/06/2019:17:20:49 -0400",
"message": "GET /embrace/supply-chains/dynamic/vertical",
"status": "201",
"bytes": "20574"
}

A few things to note about the output:

  1. The message field was overwritten.
  2. The ident field was dropped since it contained a "-" value.
  3. All values are strings, we have plans to add type coercion.
  4. Special wrapper characters were dropped, such as wrapping [...] and "..." characters.

How It Works

Blank Values

Both " " and "-" are considered blank values and their mapped field will be set to null.

Complex Processing

If you encounter limitations with the tokenizer transform then we recommend using a runtime transform. These transforms are designed for complex processing and give you the power of full programming runtime.

Environment Variables

Environment variables are supported through all of Vector's configuration. Simply add ${MY_ENV_VAR} in your Vector configuration file and the variable will be replaced before being evaluated.

You can learn more in the Environment Variables section.

Field Notation Syntax

The field and field_names options support Vector's field notiation syntax, enabling access to root-level, nested, and array field values. For example:

vector.toml
[transforms.my_tokenizer_transform_id]
# ...
field = "message"
field = "parent.child"
# ...

You can learn more about Vector's field notation in the field notation reference.

Special Characters

In order to extract raw values and remove wrapping characters, we must treat certain characters as special. These characters will be discarded:

  • "..." - Quotes are used tp wrap phrases. Spaces are preserved, but the wrapping quotes will be discarded.
  • [...] - Brackets are used to wrap phrases. Spaces are preserved, but the wrapping brackets will be discarded.
  • \ - Can be used to escape the above characters, Vector will treat them as literal.

Value Coercion

Values can be coerced upon extraction via the types.* options. This functions exactly like the coercer transform except that its coupled within this transform for convenience.

Timestamps

You can coerce values into timestamps via the timestamp type:

vector.toml
# ...
types.first_timestamp = "timestamp" # best effort parsing
types.second_timestamp = "timestamp|%Y-%m-%dT%H:%M:%S%z" # ISO8601
# ...

As noted above, if you do not specify a specific strftime format, Vector will make a best effort attempt to parse the timestamp against the following common formats:

FormatDescription
Without Timezone
%F %TYYYY-MM-DD HH:MM:SS
%v %TDD-Mmm-YYYY HH:MM:SS
FT%TISO 8601 / RFC 3339 without TZ
m/%d/%Y:%TUS common date format
a, %d %b %Y %TRFC 822/2822 without TZ
a %d %b %T %Ydate command output without TZ
A %d %B %T %Ydate command output without TZ, long names
a %b %e %T %Yctime format
With Timezone
%+ISO 8601 / RFC 3339
%a %d %b %T %Z %Ydate command output
%a %d %b %T %z %Ydate command output, numeric TZ
%a %d %b %T %#z %Ydate command output, numeric TZ
UTC Formats
%sUNIX timestamp
%FT%TZISO 8601 / RFC 3339 UTC