File Source

The Vector file source collects logs from files.

Configuration

[sources.my_source_id]
type = "file" # required
ignore_older = 600 # optional, no default, seconds
include = ["/var/log/**/*.log"] # required
  • optionalstring

    data_dir

    The directory used to persist file checkpoint positions. By default, the global data_dir option is used. Please make sure the Vector project has write permissions to this dir. See Checkpointing for more info.

    • View examples
  • optional[string]

    exclude

    Array of file patterns to exclude. Globbing is supported.Takes precedence over the include option.

    • View examples
  • optionalstring

    file_key

    The key name added to each event with the full path of the file.

    • Default: "file"
  • optionaltable

    fingerprint

    Configuration for how the file source should identify files.

    • optionaluint

      bytes

      The number of bytes read off the head of the file to generate a unique fingerprint.

      • Only relevant when: strategy = "checksum"
      • Default: 256 (bytes)
    • optionaluint

      ignored_header_bytes

      The number of bytes to skip ahead (or ignore) when generating a unique fingerprint. This is helpful if all files share a common header. See fingerprint for more info.

      • Only relevant when: strategy = "checksum"
      • Default: 0 (bytes)
    • enumoptionalstring

      strategy

      The strategy used to uniquely identify files. This is important for checkpointing when file rotation is used.

      • Default: "checksum"
      • Enum, must be one of: "checksum" "device_and_inode"
      • View examples
  • optionaluint

    glob_minimum_cooldown

    Delay between file discovery calls. This controls the interval at which Vector searches for files. See Autodiscovery and Globbing for more info.

    • Default: 1000 (milliseconds)
  • optionalstring

    host_key

    The key name added to each event representing the current host. This can also be globally set via the global [host_key](#host_key) option.

    • Default: "host"
  • commonoptionaluint

    ignore_older

    Ignore files with a data modification date that does not exceed this age.

    • View examples
  • commonrequired[string]

    include

    Array of file patterns to include. Globbing is supported. See File Read Order and File Rotation for more info.

    • View examples
  • optionaluint

    max_line_bytes

    The maximum number of a bytes a line can contain before being discarded. This protects against malformed lines or tailing incorrect files.

    • Default: 102400 (bytes)
  • optionaluint

    max_read_bytes

    An approximate limit on the amount of data read from a single file at a given time.

    • View examples
  • optionaltable

    multiline

    Multiline parsing configuration. If not specified, multiline parsing is disabled. See Multiline Messages for more info.

    • commonrequiredstring

      condition_pattern

      Condition regex pattern to look for. Exact behavior is configured via mode. See Multiline Messages for more info.

      • View examples
    • enumcommonrequiredstring

      mode

      Mode of operation, specifies how the condition_pattern is interpreted. See Multiline Messages for more info.

      • Enum, must be one of: "continue_through" "continue_past" "halt_before" "halt_with"
      • View examples
    • commonrequiredstring

      start_pattern

      Start regex pattern to look for as a beginning of the message. See Multiline Messages for more info.

      • View examples
    • commonrequireduint

      timeout_ms

      The maximum time to wait for the continuation. Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.

      • View examples
  • optionalbool

    oldest_first

    Instead of balancing read capacity fairly across all watched files, prioritize draining the oldest files before moving on to read data from younger files. See File Read Order for more info.

    • Default: false
    • View examples
  • optionaluint

    remove_after

    Timeout from reaching eof after which file will be removed from filesystem, unless new data is written in the meantime. If not specified, files will not be removed.

    • WARNING: Vector's process must have permission to delete files.
    • View examples
  • optionalbool

    start_at_beginning

    For files with a stored checkpoint at startup, setting this option to true will tell Vector to read from the beginning of the file instead of the stored checkpoint. See Read Position for more info.

    • Default: false
    • View examples

Output

This component outputs log events with the following fields:

{
"file" : "/var/log/apache/access.log",
"host" : "my-host.local",
"message" : "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
"timestamp" : "2020-10-10T17:07:36+00:00"
}
  • commonrequiredstring

    file

    The absolute path of originating file. See Context for more info.

    • View examples
  • commonrequiredstring

    host

    The local hostname, equivalent to the gethostname command.

    • View examples
  • commonrequiredstring

    message

    The raw line from the file.

    • View examples
  • commonrequiredtimestamp

    timestamp

    The exact time the event was ingested into Vector.

    • View examples

Telemetry

This component provides the following metrics that can be retrieved through the internal_metrics source. See the metrics section in the monitoring page for more info.

  • counter

    checkpoint_write_errors_total

    The total number of errors writing checkpoints. This metric includes the following tags:

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    checkpoints_total

    The total number of files checkpointed. This metric includes the following tags:

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    checksum_errors_total

    The total number of errors identifying files via checksum. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    file_delete_errors_total

    The total number of failures to delete a file. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    file_watch_errors_total

    The total number of errors encountered when watching files. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    files_added_total

    The total number of files Vector has found to watch. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    files_deleted_total

    The total number of files deleted. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    files_resumed_total

    The total number of times Vector has resumed watching a file. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    files_unwatched_total

    The total number of times Vector has stopped watching a file. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    processed_events_total

    The total number of events processed by this component. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    fingerprint_read_errors_total

    The total number of times Vector failed to read a file for fingerprinting. This metric includes the following tags:

    • file - The file that produced the error

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    processed_bytes_total

    The total number of bytes processed by the component. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

Examples

Given the following input:

53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308

And the following configuration:

[sources.file]
type = "file"
include = ["/var/log/**/*.log"]

The following Vector log event will be output:

{
"file": "/var/log/apache/access.log",
"host": "my-host.local",
"message": "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
"timestamp": "2020-10-10T17:07:36.452332Z"
}

How It Works

Autodiscovery

Vector will continually look for new files matching any of your include patterns. The frequency is controlled via the glob_minimum_cooldown option. If a new file is added that matches any of the supplied patterns, Vector will begin tailing it. Vector maintains a unique list of files and will not tail a file more than once, even if it matches multiple patterns. You can read more about how we identify files in the Identification section.

Checkpointing

Vector checkpoints the current read position after each successful read. This ensures that Vector resumes where it left off if restarted, preventing data from being read twice. The checkpoint positions are stored in the data directory which is specified via the global data_dir option, but can be overridden via the data_dir option in the file source directly.

Compressed Files

Vector will transparently detect files which have been compressed using Gzip and decompress them for reading. This detection process looks for the unique sequence of bytes in the Gzip header and does not rely on the compressed files adhering to any kind of naming convention.

One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process anycompressed files before shutting the process down or moving the files to another location on disk.

Context

By default, the file source will augment events with helpful context keys as shown in the "Output" section.

File Deletion

When a watched file is deleted, Vector will maintain its open file handle and continue reading until it reaches EOF. When a file is no longer findable in the includes option and the reader has reached EOF, that file's reader is discarded.

File Read Order

By default, Vector attempts to allocate its read bandwidth fairly across all of the files it's currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.

For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.

To address this type of situation, Vector provides the oldest_first option. When set, Vector will not read from any file younger than the oldest file that it hasn't yet caught up to. In other words, Vector will continue reading from older files as long as there is more data to read. Only once it hits the end will it then move on to read from younger files.

Whether or not to use the oldest_first flag depends on the organization of the logs you're configuring Vector to tail. If your include option contains multiple independent logical log streams (e.g. Nginx's access.log and error.log, or logs from multiple services), you are likely better off with the default behavior. If you're dealing with a single logical log stream or if you value per-stream ordering over fairness across streams, consider setting the oldest_first option to true.

File Rotation

Vector supports tailing across a number of file rotation strategies. The default behavior of logrotate is simply to move the old log file and create a new one. This requires no special configuration of Vector, as it will maintain its open file handle to the rotated log until it has finished reading and it will find the newly created file normally.

A popular alternative strategy is copytruncate, in which logrotate will copy the old log file to a new location before truncating the original. Vector will also handle this well out of the box, but there are a couple configuration options that will help reduce the very small chance of missed data in some edge cases. We recommend a combination of delaycompress (if applicable) on the logrotate side and including the first rotated file in Vector's include option. This allows Vector to find the file after rotation, read it uncompressed to identify it, and then ensure it has all of the data, including any written in a gap between Vector's last read and the actual rotation event.

Globbing

Globbing is supported in all provided file paths, files will be autodiscovered continually at a rate defined by the glob_minimum_cooldown option.

Line Delimiters

Each line is read until a new line delimiter (the 0xA byte) or EOF is found.

Multiline Messages

Sometimes a single log event will appear as multiple log lines. To handle this, Vector provides a set of multiline options. These options were carefully thought through and will allow you to solve the simplest and most complex cases. Let's look at a few examples:

Example 1: Ruy Exceptions

Ruby exceptions, when logged, consist of multiple lines:

foobar.rb:6:in `/': divided by 0 (ZeroDivisionError)
from foobar.rb:6:in `bar'
from foobar.rb:2:in `foo'
from foobar.rb:9:in `<main>'

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '^[^\s]'
mode = "continue_through"
condition_pattern = '^[\s]+from'
timeout_ms = 1000
  • start_pattern, set to ^[^\s], tells Vector that new multi-line events should not start with white-space.
  • mode, set to continue_through, tells Vector continue aggregating lines until the condition_pattern is no longer valid (excluding the invalid line).
  • condition_pattern, set to ^[\s]+from, tells Vector to continue aggregating lines if they start with white-space followed by from.

Example 2: Line Continuations

Some programming languages use the backslash (\) character to signal that a line will continue on the next line:

First line\
second line\
third line

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '\\$'
mode = "continue_past"
condition_pattern = '\\$'
timeout_ms = 1000
  • start_pattern, set to \\$, tells Vector that new multi-line events start with lines that end in \.
  • mode, set to continue_past, tells Vector continue aggregating lines, plus one additional line, until condition_pattern is false.
  • condition_pattern, set to \\$, tells Vector to continue aggregating lines if they end with a \ character.

Example 3: Line Continuations

Activity logs from services such as Elasticsearch typically begin with a timestamp, followed by information on the specific activity, as in this example:

[2015-08-24 11:49:14,389][ INFO ][env ] [Letha] using [1] data paths, mounts [[/
(/dev/disk1)]], net usable_space [34.5gb], net total_space [118.9gb], types [hfs]

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
mode = "halt_before"
condition_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
timeout_ms = 1000
  • start_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector that new multi-line events start with a timestamp sequence.
  • mode, set to halt_before, tells Vector to continue aggregating lines as long as the condition_pattern does not match.
  • condition_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector to continue aggregating up until a line starts with a timestamp sequence.

Read Position

By default, Vector will read new data only for newly discovered files, similar to the tail command. You can read from the beginning of the file by setting the start_at_beginning option to true.

Previously discovered files will be checkpointed](#checkpointing), and the read position will resume from the last checkpoint.

fingerprint

By default, Vector identifies files by creating a cyclic redundancy check (CRC) on the first 256 bytes of the file. This serves as a fingerprint to uniquely identify the file. The amount of bytes read can be controlled via the fingerprint_bytes and ignored_header_bytes options.

This strategy avoids the common pitfalls of using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.