Structuring Your Log Data

How to parse log data in Vector
type: tutorialdomain: config

Structured logs are like cocktails; they're cool because they're complicated. In this guide we'll build a pipeline using transformations that allows us to send unstructured events through it that look like this:

172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272

And have them coming out the other end in a structured format like this:

{
"bytes_in":"656",
"timestamp":"2019-05-03T13:11:48-04:00",
"method":"PUT",
"bytes_out":"10272",
"host":"172.128.80.109",
"status":"406",
"user":"Bins5273",
"path":"/mesh"
}

Tutorial

  1. Setup a basic pipeline

    In the last guide we simply piped stdin to stdout, I'm not trying to diminish your sense of achievement but that was pretty basic.

    This time we're going to build a config we might use in the real world. It's going to consume logs over TCP with a socket source and write them to an elasticsearch sink.

    The basic source to sink version of our pipeline looks like this:

    vector.toml
    [sources.foo]
    type = "socket"
    address = "0.0.0.0:9000"
    mode = "tcp"
    [sinks.bar]
    inputs = ["foo"]
    type = "elasticsearch"
    index = "example-index"
    host = "http://10.24.32.122:9000"

    If we were to run it then the raw data we consume over TCP would be captured in the field message, and the object we'd publish to Elasticsearch would look like this:

    log event
    {"message":"172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272","host":"foo","timestamp":"2019-05-03T13:11:48-04:00"}

    That's hardly structured at all! Let's remedy that by adding our first transform.

  2. Add a parsing transform

    Nothing in this world is ever good enough for you, why should events be any different?

    Vector makes it easy to mutate events into a more (or less) structured format with transforms. Let's parse our logs into a structured format by capturing named regular expression groups with a regex_parser transform.

    A config can have any number of transforms and it's entirely up to you how they are chained together. Similar to sinks, a transform requires you to specify where its data comes from. When a sink is configured to accept data from a transform the pipeline is complete.

    Let's place our new transform in between our existing source and sink:

    vector.toml
    [sources.foo]
    type = "socket"
    address = "0.0.0.0:9000"
    mode = "tcp"
    +[transforms.apache_parser]
    + inputs = ["foo"]
    + type = "regex_parser"
    + field = "message"
    + regex = '^(?P<host>[\w\.]+) - (?P<user>[\w]+) (?P<bytes_in>[\d]+) \[(?P<timestamp>.*)\] "(?P<mathod>[\w]+) (?P<path>.*)" (?P<status>[\d]+) (?P<bytes_out>[\d]+)$'
    +
    [sinks.bar]
    - inputs = ["foo"]
    + inputs = ["apache_parser"]
    type = "elasticsearch"
    index = "example-index"
    host = "http://10.24.32.122:9000"

    This regular expression looks great and it probably works, but it's best to be sure, right? Which leads us onto the next step.

  3. Test it

    No one is saying that unplanned explosions aren't cool, but you should be doing that in your own time. In order to test our transform we could set up a local Elasticsearch instance and run the whole pipeline, but that's an awful bother and Vector has a much better way.

    Instead, we can write unit tests as part of our config just like you would for regular code:

    vector.toml
    # Write the data
    [sinks.bar]
    inputs = ["apache_parser"]
    type = "elasticsearch"
    index = "example-index"
    host = "http://10.24.32.122:9000"
    +
    +[[tests]]
    + name = "test apache regex"
    +
    + [[tests.inputs]]
    + insert_at = "apache_parser"
    + type = "raw"
    + value = "172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272"
    +
    + [[tests.outputs]]
    + extract_from = "apache_parser"
    + [[tests.outputs.conditions]]
    + type = "check_fields"
    + "method.eq" = "PUT"
    + "host.eq" = "172.128.80.109"
    + "timestamp.eq" = "2019-05-03T13:11:48-04:00"
    + "path.eq" = "/mesh"
    + "status.eq" = "406"

    This unit test spec has a name, defines an input event to feed into our pipeline at a specific transform (in this case our only transform), and defines where we'd like to capture resulting events coming out along with a condition to check the events against.

    When we run:

    vector test ./vector.toml

    It will parse and execute our test:

    Running vector.toml tests
    test vector.toml: test apache regex ... failed
    failures:
    --- vector.toml ---
    test 'test apache regex':
    check transform 'apache_parser' failed conditions:
    condition[0]: predicates failed: [ method.eq: "PUT" ]
    payloads (events encoded as JSON):
    input: {"timestamp":"2020-02-20T10:19:27.283745Z","message":"172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272"}
    output: {"bytes_in":"656","timestamp":"2019-05-03T13:11:48-04:00","mathod":"PUT","bytes_out":"10272","host":"172.128.80.109","status":"406","user":"Bins5273","path":"/mesh"}

    By Jove! There was a problem with our regular expression! Our test has pointed out that the predicate method.eq failed, and has helpfully printed our input and resulting events in JSON format.

    This allows us to inspect exactly what our transform is doing, and it turns out that the method from our Apache log is actually being captured in a field mathod.

    See if you can spot the typo, once it's fixed we can run vector test ./vector.toml again and we should get this:

    Running vector.toml tests
    test vector.toml: test apache regex ... passed

    Success! Next, try experimenting by adding more transforms to your pipeline before moving onto the next guide.

Next Steps

Now that you're a Vector pro you'll have endless ragtag groups of misfits trying to recruit you as their hacker, but it won't mean much if you can't deploy Vector. Onto the next guide!