Enrich your observability data

Learn how to use CSV enrichment to provide more context to your data

type: guide domain: enriching

Before you begin, this guide assumes the following:

It’s important for any organization maintaining observability pipelines to have flexibility to manipulate their observability data — whether it be appending better context to an event or triggering alerts when there is a potential threat. A key component that drives that flexibility is enriching your data from different sources. Vector now offers initial support for enriching events from external data sources, which is currently powered by a powerful Vector concept, enrichment tables. For now, our support for enrichment through external data sources is limited to csv files, but we’re looking to add support for more data sources.

This guide walks through how you can start enriching your observability by exploring two interesting use cases.

Use Case 1 - IoT Devices

When working with IoT devices, you often want to minimize payloads being emmitted by the devices, even when collecting events from those devices. This is where Enrichment tables comes in very handy. Let’s assume that you have an IoT device that emits its status - online, offline, transmitting, and error states, but the device needs to emit as little payload as possible due to technical constraints or requirements; instead, the IoT can device can emit integers — 1 for online, 2 for offline, etc.

To enrich the IoT device’s observability data, let’s use a csv file containing the following information:

status_code,status_message
1,"device status online"
2,"device status offline"
3,"device status connection error"
...
7,"device status transmitting"
8,"device status transmission complete"
9,"device status not responding"

To enrich your observability data, you can use ‘get_enrichment_table_record’. This function searches your enrichment table for a row that matches the original field value, replacing that with a new value in that row. Assuming that your csv file is called iot_status.csv, the following illustrates the required Vector configuration:

[enrichment_tables.iot_status]
type = "file"

[enrichment_tables.iot_status.file]
path = "/etc/vector/iot_status.csv"
encoding = { type = "csv" }

[enrichment_tables.iot_status.schema]
status_code = "integer"
status_message = "string"

After this configuration, we can now translate the output from IoT devices to human-readable messages that provide further context in our iot_status.csv. To do so, we can make use of the get_enrichment_table_record function.

[transforms.enrich_iot_status]
type = "remap"
inputs = ["datadog_agents"]
source = '''
. = parse_json!(.status_message)

status_code = del(.status_code)

# In the case that no row with a matching value is found, the original value of
# the status code is assigned.
row = get_enrichment_table_record("iot_status", {"status_code" : status_code}) ?? status_code

.status = row.status_message
'''

Your observability data, assuming it was in JSON, now has been transformed from:

{
  "host":"my.host.com",
  "timestamp":"2019-11-01T21:15:47+00:00",
  ...
  "status_code":1,
}

To:

{
  "host":"my.host.com",
  "timestamp":"2019-11-01T21:15:47+00:00",
  ...
  "status":"device status transmission complete",
}

While the previous example is relatively straightforward, the second example use case really shows how powerful event enrichment can be for your observability pipeline.

Use Case 2 - Setting Alerts for Suspicious Access

In this example, you are dealing with a system where attempted access from specific identifiers, such as a blacklisted IP address, must automatically trigger alerts to relevant personnel on your team. You can use an external database (though Vector’s current solution is limited to csv files) that contain a list of blacklisted IP address and enrich the data so that whatever downstream log management solution your team may be using, such as [Datadog’s Log Management product], can trigger the alert based on the scrubbed log.

The key benefit to using enrichment tables in this case is that you can enrich your observability data to trigger an alert from a data source. In cases where you want to avoid exposing this data source due to its sensitivity, you can do this entirely on your on-premise infrastructure, rather than uploading the dataset to a 3rd-party log management solution.

Let’s assume that you have a csv file containing a list of sensitive IP addresses that’s deemed suspicious for your company for one reason or another whether it be from your own proprietary list or a 3rd-party source, such as Emerging Threats or FBI InfraGard. Any IP address that’s on the list that attempts to access specific URL on your service must trigger an alert to your team for further investigation. In that case, you can set your csv file, let’s call it similarly to below:

ip,alert_type,severity
"192.0.2.0", "alert", "high"
"198.51.100.0", "alert", "medium"
...
"203.0.113.0", "warn", "medium"

Assuming you set the enrichment_tables similarly to the configuration in Use Case 1 example, the following configuration is all that’s necessary:

[transforms.ip_alert]
type = "remap"
inputs = ["datadog_agent"]
source = '''

. = parse_json!(.message)

ip = del(.ip)

row = get_enrichment_table_record("ip_info", { "ip" : ip }) ?? ip

.alert.type = row.alert_type
.alert.severity = row.severity

'''

Your observability data, assuming it was in JSON, now has been transformed from:

{
  "host":"my.host.com",
  "timestamp":"2019-11-01T21:15:47+00:00",
  ...
  "ip":"192.0.2.0",
}

To:

{
  "host":"my.host.com",
  "timestamp":"2019-11-01T21:15:47+00:00",
  ...
  "alert": {
    "type":"alert",
    "severity":"medium"
  }
}

These examples are intended to serve as a basic guide on how you can start leveraging enrichment tables. If you are or decide to use enrichment tables for other use cases, let us know on our Discord chat or Twitter, along with any feedback or request for additional support you have for the Vector team!