Getting Started Guide

Vector is a simple beast to tame. In this guide we'll build a pipeline with some common transformations, send an event through it, and touch on some basic concepts.

1. Install Vector

If you haven't already, install Vector. Here's a script for the lazy:

curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | sh

Or choose your preferred installation method.

2. Configure it

Vector runs with a configuration file that tells it which components to run and how they should interact. Let's create one that reads unstructured Apache logs over TCP using a socket source and then writes them to an elasticsearch sink. We'll do this all without having to setup a local Elasticsearch cluster:

vector.toml
# Consume data
[sources.foo]
type = "socket"
address = "0.0.0.0:9000"
mode = "tcp"
# Write the data
[sinks.bar]
inputs = ["foo"]
type = "elasticsearch"
index = "example-index"
host = "http://10.24.32.122:9000"

Every component within a Vector config has an identifier chosen by you. This allows you to specify where a sink should gather its data from (using the inputs field).

That's it for our first config, if we were to run it then the raw data we consume over TCP would be captured in the field message, and the object we'd publish to Elasticsearch would look like this:

{"message":"172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272","host":"foo","timestamp":"2019-05-03T13:11:48-04:00"}

It would be much better if we could parse out the contents of the Apache logs into structured fields.

3. Transform events

Nothing in this world is ever good enough for you, why should events be any different?

Vector makes it easy to mutate events into a more (or less) structured format with transforms. Let's parse our logs into a structured format by capturing named regular expression groups with a regex_parser transform.

A config can have any number of transforms and it's entirely up to you how they are chained together. Similar to sinks, a transform requires you to specify where its data comes from. When a sink is configured to accept data from a transform the pipeline is complete.

Let's place our new transform in between our existing source and sink:

  • Diff
  • Full Config
vector.toml
# Consume data
[sources.foo]
type = "socket"
address = "0.0.0.0:9000"
mode = "tcp"
+# Structure the data
+[transforms.apache_parser]
+ inputs = ["foo"]
+ type = "regex_parser"
+ field = "message"
+ regex = '^(?P<host>[\w\.]+) - (?P<user>[\w]+) (?P<bytes_in>[\d]+) \[(?P<timestamp>.*)\] "(?P<mathod>[\w]+) (?P<path>.*)" (?P<status>[\d]+) (?P<bytes_out>[\d]+)$'
+
# Write the data
[sinks.bar]
- inputs = ["foo"]
+ inputs = ["apache_parser"]
type = "elasticsearch"
index = "example-index"
host = "http://10.24.32.122:9000"

This regular expression looks great and it probably works, but it's best to be sure, right?

4. Test it

No one is saying that unplanned explosions aren't cool, but you should be doing that in your own time. In order to test our transform we could set up a local Elasticsearch instance and run the whole pipeline, but that's an awful bother and Vector has a much better way.

Instead, we can write unit tests as part of our config just like you would for regular code:

  • Diff
  • Full Config
vector.toml
# Write the data
[sinks.bar]
inputs = ["apache_parser"]
type = "elasticsearch"
index = "example-index"
host = "http://10.24.32.122:9000"
+
+[[tests]]
+ name = "test apache regex"
+
+ [[tests.inputs]]
+ insert_at = "apache_parser"
+ type = "raw"
+ value = "172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272"
+
+ [[tests.outputs]]
+ extract_from = "apache_parser"
+ [[tests.outputs.conditions]]
+ type = "check_fields"
+ "method.eq" = "PUT"
+ "host.eq" = "172.128.80.109"
+ "timestamp.eq" = "2019-05-03T13:11:48-04:00"
+ "path.eq" = "/mesh"
+ "status.eq" = "406"

This unit test spec has a name, defines an input event to feed into our pipeline at a specific transform (in this case our only transform), and defines where we'd like to capture resulting events coming out along with a condition to check the events against.

When we run vector test ./vector.toml it will parse and execute our test:

Running vector.toml tests
test vector.toml: test apache regex ... failed
failures:
--- vector.toml ---
test 'test apache regex':
check transform 'apache_parser' failed conditions:
condition[0]: predicates failed: [ method.eq: "PUT" ]
payloads (events encoded as JSON):
input: {"timestamp":"2020-02-20T10:19:27.283745Z","message":"172.128.80.109 - Bins5273 656 [2019-05-03T13:11:48-04:00] \"PUT /mesh\" 406 10272"}
output: {"bytes_in":"656","timestamp":"2019-05-03T13:11:48-04:00","mathod":"PUT","bytes_out":"10272","host":"172.128.80.109","status":"406","user":"Bins5273","path":"/mesh"}

By Jove! There was a problem with our regular expression! Our test has pointed out that the predicate method.eq failed, and has helpfully printed our input and resulting events in JSON format.

This allows us to inspect exactly what our transform is doing, and it turns out that the method from our Apache log is actually being captured in a field mathod.

See if you can spot the typo, once it's fixed we can run vector test ./vector.toml again and we should get this:

Running vector.toml tests
test vector.toml: test apache regex ... passed

Success! Next, try experimenting by adding more transforms to your pipeline before moving onto the next guide.

Good luck, now that you're a Vector pro you'll have endless ragtag groups of misfits trying to recruit you as their hacker.