Structuring, Shaping, and Transforming Data

Use Vector to parse, structure, shape, and transform observability data

type: guide domain: processing

Before you begin, this guide assumes the following:

Vector provides multiple transforms that you can use to modify your observability data as it passes through your Vector topology.

The transform that you will likely use most often is the remap transform, which uses a single-purpose data transformation language called Vector Remap Language (VRL for short) to define event transformation logic. VRL has several features that should make it your first choice for transforming data in Vector:

  • It offers a wide range of observability-data-specific functions that map directly to observability use cases.
  • It’s built for the very specific use case of working with Vector logs and metrics, which means that it has no extraneous functionality, its data model maps directly to Vector’s internal data model, and its performance is comparable to native Rust performance.
  • The VRL compiler built into Vector performs several compile-time checks to ensure that your VRL code is sound, meaning no dead code, no unhandled errors, and no type mismatches.

In cases where VRL doesn’t fit your use case, Vector also offers a Lua runtime transform that offer a bit more flexibility than VRL but also come with downsides (listed below) that should always be borne in mind.

Transforming data using VRL

Let’s jump straight into an example of using VRL to modify some data. We’ll create a simple topology consisting of three components:

  1. A demo_logs source produces random Syslog messages at a rate of 10 per second.
  2. A remap transform uses VRL to parse incoming Syslog lines into named fields (severity, timestamp, etc.).
  3. A console sink pipes the output of the topology to stdout, so that we can see the results on the command line.

This configuration defines that topology:

sources:
  logs:
    type: demo_logs
    format: syslog
    interval: 0.1

transforms:
  modify:
    type: remap
    inputs:
      - logs
    source: |
      # Parse Syslog input. The "!" means that the script should abort on error.
      . = parse_syslog!(.message)      

sinks:
  out:
    type: console
    inputs:
      - modify
    encoding:
      codec: json
Although we’re using YAML for the configuration here, Vector also supports TOML and JSON.

To start Vector using this topology:

vector --config /etc/vector/vector.yaml

You should see lines like this emitted via stdout (formatted for readability here):

{
  "appname": "authsvc",
  "facility": "daemon",
  "hostname": "acmecorp.biz",
  "message": "#hugops to everyone who has to deal with this",
  "msgid": "ID486",
  "procid": 5265,
  "severity": "notice",
  "timestamp": "2021-01-19T18:16:40.027Z"
}

So far, we’ve gotten Vector to parse the Syslog data but we’re not yet modifying that data. So let’s update the source script of our remap transform to make some ad hoc transformations:

transforms:
  modify:
    type: remap
    inputs:
      - logs
    source: |
      . = parse_syslog!(.message)

      # Convert the timestamp to a Unix timestamp, aborting on error
      .timestamp = to_unix_timestamp!(.timestamp)

      # Remove the "facility" and "procid" fields
      del(.facility)
      del(.procid)

      # Replace the "msgid" field with a unique ID
      .msgid = uuid_v4()

      # If the log message contains the phrase "Great Scott!", set the new field
      # "critical" to true, otherwise set it to false. If the "contains" function
      # errors, log the error (instead of aborting the script, as above).
      if (is_critical, err = contains(.message, "Great Scott!"); err != null) {
        log(err, level: "error")
      }

      .critical = is_critical      

A few things to notice about this script:

  • Any errors thrown by VRL functions must be handled. Were we to neglect to handle the potential error thrown by the parse_syslog function, for example, the VRL compiler would provide a very specific warning and Vector wouldn’t start up.
  • VRL has language constructs like variables, if statements, comments, and logging.
  • The . acts as a sort of “container” for the event data. . by itself refers to the root event, while you can use paths like .foo, .foo[0], .foo.bar, .foo.bar[0], and so on to reference subfields, array indices, and more.

If you stop and restart Vector, you should see log lines like this (again reformatted for readability):

{
  "appname": "authsvc",
  "hostname": "acmecorp.biz",
  "message": "Great Scott! We're never gonna reach 88 mph with the flux capacitor in its current state!",
  "msgid": "4e4437b6-13e8-43b3-b51e-c37bd46de490",
  "severity": "notice",
  "timestamp": 1611080200,
  "critical": true
}

And that’s it! We’ve successfully created a Vector topology that transforms every event that passes through it. If you’d like to know more about VRL, we recommend checking out the following documentation:

Lua runtime transform

If VRL doesn’t cover your use case—and that should happen rarely—Vector also offers a lua runtime transform that you can use instead of VRL. It enables you to run Lua code that you can include directly in your Vector configuration

The lua transform provides maximal flexibility because they enable you to use a full-fledged programming language right inside of Vector. But we recommend using it only when truly necessary, for several reasons:

  1. The lua transform makes it all too easy to write scripts that are slow, error prone, and hard to read.
  2. It requires you to add a coding/testing/debugging workflow to using Vector, which is worth the effort if there’s no other way to satisfy your use case but best avoided if possible.
  3. It imposes a performance penalty vis-à-vis VRL.