File source

Requirements

The vector process must have the ability to read the files listed in include and execute any of the parent directories for these files. Please see File permissions for more details.

Configuration

Example configurations

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "include": [
        "/var/log/**/*.log"
      ]
    }
  }
}

[sources.my_source_id]
type = "file"
include = [ "/var/log/**/*.log" ]

sources:
  my_source_id:
    type: file
    include:
      - /var/log/**/*.log

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "data_dir": "/var/local/lib/vector/",
      "exclude": [
        "/var/log/binary-file.log"
      ],
      "file_key": "file",
      "glob_minimum_cooldown_ms": 1000,
      "host_key": "hostname",
      "ignore_older_secs": 600,
      "include": [
        "/var/log/**/*.log"
      ],
      "line_delimiter": "\n",
      "max_line_bytes": 102400,
      "max_read_bytes": 2048,
      "offset_key": "offset",
      "read_from": "beginning",
      "rotate_wait_secs": 9223372036854776000
    }
  }
}

[sources.my_source_id]
type = "file"
data_dir = "/var/local/lib/vector/"
exclude = [ "/var/log/binary-file.log" ]
file_key = "file"
glob_minimum_cooldown_ms = 1_000
host_key = "hostname"
ignore_older_secs = 600
include = [ "/var/log/**/*.log" ]
line_delimiter = """

"""
max_line_bytes = 102_400
max_read_bytes = 2_048
offset_key = "offset"
read_from = "beginning"
rotate_wait_secs = 9_223_372_036_854_776_000

sources:
  my_source_id:
    type: file
    data_dir: /var/local/lib/vector/
    exclude:
      - /var/log/binary-file.log
    file_key: file
    glob_minimum_cooldown_ms: 1000
    host_key: hostname
    ignore_older_secs: 600
    include:
      - /var/log/**/*.log
    line_delimiter: "\n"
    max_line_bytes: 102400
    max_read_bytes: 2048
    offset_key: offset
    read_from: beginning
    rotate_wait_secs: 9223372036854776000

acknowledgements

optional object

Deprecated

This field is deprecated.

Controls how acknowledgements are handled by this source.

This setting is deprecated in favor of enabling acknowledgements at the global or sink level.

Enabling or disabling acknowledgements at the source level has no effect on acknowledgement behavior.

See End-to-end Acknowledgements for more information on how event acknowledgement is handled.

acknowledgements.enabled

optional bool

Whether or not end-to-end acknowledgements are enabled for this source.

data_dir

optional string literal

The directory used to persist file checkpoint positions.

By default, the global data_dir option is used. Make sure the running user has write permissions to this directory.

If this directory is specified, then Vector will attempt to create it.

Examples

"/var/local/lib/vector/"

encoding

optional object

Character set encoding.

encoding.charset

required string literal

Encoding of the source messages.

Takes one of the encoding label strings defined as part of the Encoding Standard.

When set, the messages are transcoded from the specified encoding to UTF-8, which is the encoding that is assumed internally for string-like data. Enable this transcoding operation if you need your data to be in UTF-8 for further processing. At the time of transcoding, any malformed sequences (that can’t be mapped to UTF-8) is replaced with the Unicode REPLACEMENT CHARACTER and warnings are logged.

Examples

"utf-16le"

"utf-16be"

exclude

optional [string]

Array of file patterns to exclude. Globbing is supported.

Takes precedence over the include option. Note: The exclude patterns are applied after the attempt to glob everything in include. This means that all files are first matched by include and then filtered by the exclude patterns. This can be impactful if include contains directories with contents that are not accessible.

Array string literal

Examples

[
  "/var/log/binary-file.log"
]

file_key

optional string literal

Overrides the name of the log field used to add the file path to each event.

The value is the full path to the file where the event was read message.

Set to "" to suppress this key.

Examples

"path"

default: file

fingerprint

optional object

Configuration for how files should be identified.

This is important for checkpointing when file rotation is used.

fingerprint.ignored_header_bytes

optional uint

The number of bytes to skip ahead (or ignore) when reading the data used for generating the checksum. If the file is compressed, the number of bytes refer to the header in the uncompressed content. Only gzip is supported at this time.

This can be helpful if all files share a common header that should be skipped.

Relevant when: strategy = "checksum"

fingerprint.lines

optional uint

The number of lines to read for generating the checksum.

The number of lines are determined from the uncompressed content if the file is compressed. Only gzip is supported at this time.

If the file has less than this amount of lines, it won’t be read at all.

Relevant when: strategy = "checksum"

default: 1(lines)

fingerprint.strategy

optional string literal enum

The strategy used to uniquely identify files.

This is important for checkpointing when file rotation is used.

Enum options

Option	Description
`checksum`	Read lines from the beginning of the file and compute a checksum over them.
`device_and_inode`	Use the device and inode as the identifier.

default: checksum

glob_minimum_cooldown_ms

optional uint

The delay between file discovery calls.

This controls the interval at which files are searched. A higher value results in greater chances of some short-lived files being missed between searches, but a lower value increases the performance impact of file discovery.

default: 1000(milliseconds)

host_key

optional string literal

Overrides the name of the log field used to add the current hostname to each event.

By default, the global log_schema.host_key option is used.

Set to "" to suppress this key.

Examples

"hostname"

ignore_checkpoints

optional bool

Whether or not to ignore existing checkpoints when determining where to start reading a file.

Checkpoints are still written normally.

ignore_not_found

optional bool

Ignore missing files when fingerprinting.

This may be useful when used with source directories containing dangling symlinks.

default: false

ignore_older_secs

optional uint

Ignore files with a data modification date older than the specified number of seconds.

Examples

include

required [string]

Array of file patterns to include. Globbing is supported.

Array string literal

Examples

[
  "/var/log/**/*.log"
]

internal_metrics

optional object

Configuration of internal metrics for file-based components.

internal_metrics.include_file_tag

optional bool

Whether or not to include the “file” tag on the component’s corresponding internal metrics.

This is useful for distinguishing between different files while monitoring. However, the tag’s cardinality is unbounded.

default: false

line_delimiter

optional string literal

String sequence used to separate one file line from another.

Examples

"\r\n"

default:

max_line_bytes

optional uint

The maximum size of a line before it is discarded.

This protects against malformed lines or tailing incorrect files.

default: 102400(bytes)

max_read_bytes

optional uint

Max amount of bytes to read from a single file before switching over to the next file. Note: This does not apply when oldest_first is true.

This allows distributing the reads more or less evenly across the files.

default: 2048(bytes)

multiline

optional object

Multiline aggregation configuration.

If not specified, multiline aggregation is disabled.

multiline.condition_pattern

required string literal

Regular expression pattern that is used to determine whether or not more lines should be read.

This setting must be configured in conjunction with mode.

Examples

"^[\\s]+"

"\\\\$"

"^(INFO|ERROR) "

";$"

multiline.mode

required string literal enum

Aggregation mode.

This setting must be configured in conjunction with condition_pattern.

Enum options

Option	Description
`continue_past`	All consecutive lines matching this pattern, plus one additional line, are included in the group. This is useful in cases where a log message ends with a continuation marker, such as a backslash, indicating that the following line is part of the same message.
`continue_through`	All consecutive lines matching this pattern are included in the group. The first line (the line that matched the start pattern) does not need to match the `ContinueThrough` pattern. This is useful in cases such as a Java stack trace, where some indicator in the line (such as a leading whitespace) indicates that it is an extension of the proceeding line.
`halt_before`	All consecutive lines not matching this pattern are included in the group. This is useful where a log line contains a marker indicating that it begins a new message.
`halt_with`	All consecutive lines, up to and including the first line matching this pattern, are included in the group. This is useful where a log line ends with a termination marker, such as a semicolon.

Examples

"continue_past"

"continue_through"

"halt_before"

"halt_with"

multiline.start_pattern

required string literal

Regular expression pattern that is used to match the start of a new message.

Examples

"^[\\s]+"

"\\\\$"

"^(INFO|ERROR) "

";$"

multiline.timeout_ms

required uint

The maximum amount of time to wait for the next additional line, in milliseconds.

Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.

Examples

offset_key

optional string literal

Enables adding the file offset to each event and sets the name of the log field used.

The value is the byte offset of the start of the line within the file.

Off by default, the offset is only added to the event if this is set.

Examples

"offset"

oldest_first

optional bool

Instead of balancing read capacity fairly across all watched files, prioritize draining the oldest files before moving on to read data from more recent files.

default: false

read_from

optional string literal enum

File position to use when reading a new file.

Enum options string literal

Option	Description
`beginning`	Read from the beginning of the file.
`end`	Start reading from the current end of the file.

default: beginning

remove_after_secs

optional uint

After reaching EOF, the number of seconds to wait before removing the file, unless new data is written.

If not specified, files are not removed.

Warning

Vector’s process must have permission to delete files.

Examples

rotate_wait_secs

optional uint

How long to keep an open handle to a rotated log file. The default value represents “no limit”

default: 9.223372036854776e+18(seconds)

Outputs

<component_id>

Default output stream of the component. Use this component’s ID as an input to downstream transforms and sinks.

Output Types

Logs

Warning

The fields shown below will be different if log namespacing is enabled. See Log Namespacing for more details

Line

An individual line from a file. Lines can be merged using the multiline options.

Fields

file required string literal

The absolute path of originating file.

Examples

/var/log/apache/access.log

host required string literal

The local hostname, equivalent to the gethostname command.

Examples

my-host.local

message required string literal

The raw line from the file.

Examples

53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308

source_type required string literal

The name of the source type.

Examples

file

timestamp required timestamp

The exact time the event was ingested into Vector.

Examples

2020-10-10T17:07:36.452332Z

Telemetry

Metrics

link

checkpoints_total

counter

The total number of files checkpointed.

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

checksum_errors_total

counter

The total number of errors identifying files via checksum.

file optional

The file that produced the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

component_discarded_events_total

counter

The number of events dropped by this component.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

host optional

The hostname of the system Vector is running on.

intentional

True if the events were discarded intentionally, like a filter transform, or false if due to an error.

pid optional

The process ID of the Vector instance.

component_errors_total

counter

The total number of errors encountered by this component.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

error_type

The type of the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

stage

The stage within the component at which the error occurred.

component_received_bytes_total

counter

The number of raw bytes accepted by this component from source origins.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

container_name optional

The name of the container from which the data originated.

file optional

The file from which the data originated.

host optional

The hostname of the system Vector is running on.

mode optional

The connection mode used by the component.

peer_addr optional

The IP from which the data originated.

peer_path optional

The pathname from which the data originated.

pid optional

The process ID of the Vector instance.

pod_name optional

The name of the pod from which the data originated.

uri optional

The sanitized URI from which the data originated.

component_received_event_bytes_total

counter

The number of event bytes accepted by this component either from tagged origins like file and uri, or cumulatively from other origins.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

container_name optional

The name of the container from which the data originated.

file optional

The file from which the data originated.

host optional

The hostname of the system Vector is running on.

mode optional

The connection mode used by the component.

peer_addr optional

The IP from which the data originated.

peer_path optional

The pathname from which the data originated.

pid optional

The process ID of the Vector instance.

pod_name optional

The name of the pod from which the data originated.

uri optional

The sanitized URI from which the data originated.

component_received_events_count

histogram

A histogram of the number of events passed in each internal batch in Vector’s internal topology.

Note that this is separate than sink-level batching. It is mostly useful for low level debugging performance issues in Vector due to small internal batches.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

container_name optional

The name of the container from which the data originated.

file optional

The file from which the data originated.

host optional

The hostname of the system Vector is running on.

mode optional

The connection mode used by the component.

peer_addr optional

The IP from which the data originated.

peer_path optional

The pathname from which the data originated.

pid optional

The process ID of the Vector instance.

pod_name optional

The name of the pod from which the data originated.

uri optional

The sanitized URI from which the data originated.

component_received_events_total

counter

The number of events accepted by this component either from tagged origins like file and uri, or cumulatively from other origins.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

container_name optional

The name of the container from which the data originated.

file optional

The file from which the data originated.

host optional

The hostname of the system Vector is running on.

mode optional

The connection mode used by the component.

peer_addr optional

The IP from which the data originated.

peer_path optional

The pathname from which the data originated.

pid optional

The process ID of the Vector instance.

pod_name optional

The name of the pod from which the data originated.

uri optional

The sanitized URI from which the data originated.

component_sent_event_bytes_total

counter

The total number of event bytes emitted by this component.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

host optional

The hostname of the system Vector is running on.

output optional

The specific output of the component.

pid optional

The process ID of the Vector instance.

component_sent_events_total

counter

The total number of events emitted by this component.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

host optional

The hostname of the system Vector is running on.

output optional

The specific output of the component.

pid optional

The process ID of the Vector instance.

files_added_total

counter

The total number of files Vector has found to watch.

file optional

The file that produced the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

files_deleted_total

counter

The total number of files deleted.

file optional

The file that produced the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

files_resumed_total

counter

The total number of times Vector has resumed watching a file.

file optional

The file that produced the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

files_unwatched_total

counter

The total number of times Vector has stopped watching a file.

file optional

The file that produced the error

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

open_files

counter

The total number of open files.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

source_lag_time_seconds

histogram

The difference between the timestamp recorded in each event and the time when it was ingested, expressed as fractional seconds.

component_id

The Vector component ID.

component_kind

The Vector component kind.

component_type

The Vector component type.

host optional

The hostname of the system Vector is running on.

pid optional

The process ID of the Vector instance.

Examples

Apache Access Log

Given this event...

53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308

...and this configuration...

sources:
  my_source_id:
    type: file
    include:
      - /var/log/**/*.log

[sources.my_source_id]
type = "file"
include = [ "/var/log/**/*.log" ]

{
  "sources": {
    "my_source_id": {
      "type": "file",
      "include": [
        "/var/log/**/*.log"
      ]
    }
  }
}

...this Vector event is produced:

{
  "file": "/var/log/apache/access.log",
  "host": "my-host.local",
  "message": "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
  "source_type": "file",
  "timestamp": "2020-10-10T17:07:36.452332Z"
}

How it works

Autodiscovery

Vector will continually look for new files matching any of your include patterns. The frequency is controlled via the glob_minimum_cooldown option. If a new file is added that matches any of the supplied patterns, Vector will begin tailing it. Vector maintains a unique list of files and will not tail a file more than once, even if it matches multiple patterns. You can read more about how we identify files in the Identification section.

Checkpointing

Vector checkpoints the current read position after each successful read. This ensures that Vector resumes where it left off if restarted, preventing data from being read twice. The checkpoint positions are stored in the data directory which is specified via the global data_dir option, but can be overridden via the data_dir option in the file source directly.

Compressed Files

Vector will transparently detect files which have been compressed using Gzip and decompress them for reading. This detection process looks for the unique sequence of bytes in the Gzip header and does not rely on the compressed files adhering to any kind of naming convention.

One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process any compressed files before shutting the process down or moving the files to another location on disk.

Context

By default, the file source augments events with helpful context keys.

File Deletion

When a watched file is deleted, Vector will maintain its open file handle and continue reading until it reaches EOF. When a file is no longer findable in the includes option and the reader has reached EOF, that file’s reader is discarded.

File Read Order

By default, Vector attempts to allocate its read bandwidth fairly across all of the files it’s currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.

For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.

To address this type of situation, Vector provides the oldest_first option. When set, Vector will not read from any file younger than the oldest file that it hasn’t yet caught up to. In other words, Vector will continue reading from older files as long as there is more data to read. Only once it hits the end will it then move on to read from younger files.

Whether or not to use the oldest_first flag depends on the organization of the logs you’re configuring Vector to tail. If your include option contains multiple independent logical log streams (e.g. Nginx’s access.log and error.log, or logs from multiple services), you are likely better off with the default behavior. If you’re dealing with a single logical log stream or if you value per-stream ordering over fairness across streams, consider setting the oldest_first option to true.

File Rotation

Vector supports tailing across a number of file rotation strategies. The default behavior of logrotate is simply to move the old log file and create a new one. This requires no special configuration of Vector, as it will maintain its open file handle to the rotated log until it has finished reading and it will find the newly created file normally.

A popular alternative strategy is copytruncate, in which logrotate will copy the old log file to a new location before truncating the original. Vector will also handle this well out of the box, but there are a couple configuration options that will help reduce the very small chance of missed data in some edge cases. We recommend a combination of delaycompress (if applicable) on the logrotate side and including the first rotated file in Vector’s include option. This allows Vector to find the file after rotation, read it uncompressed to identify it, and then ensure it has all of the data, including any written in a gap between Vector’s last read and the actual rotation event.

Fingerprinting

By default, Vector identifies files by running a cyclic redundancy check (CRC) on the first N lines of the file. This serves as a fingerprint that uniquely identifies the file. The number of lines, N, that are read can be set using the fingerprint.lines and fingerprint.ignored_header_bytes options. Note that for compressed files, these lines and header bytes refer to the uncompressed content.

This strategy avoids the common pitfalls associated with using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.

Globbing

Globbing is supported in all provided file paths, files will be autodiscovered continually at a rate defined by the glob_minimum_cooldown option.

Line Delimiters

Each line is read until a new line delimiter (by default, i.e. the 0xA byte) or EOF is found. If needed, the default line delimiter can be overridden via the line_delimiter option.

Multiline Messages

Sometimes a single log event will appear as multiple log lines. To handle this, Vector provides a set of multiline options. These options were carefully thought through and will allow you to solve the simplest and most complex cases. Let’s look at a few examples:

Example 1: Ruby Exceptions

Ruby exceptions, when logged, consist of multiple lines:

foobar.rb:6:in `/': divided by 0 (ZeroDivisionError)
	from foobar.rb:6:in `bar'
	from foobar.rb:2:in `foo'
	from foobar.rb:9:in `<main>'

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
	type = "file"
	# ...

	[sources.my_file_source.multiline]
		start_pattern = '^[^\s]'
		mode = "continue_through"
		condition_pattern = '^[\s]+from'
		timeout_ms = 1000

start_pattern, set to ^[^\s], tells Vector that new multi-line events should not start with white-space.
mode, set to continue_through, tells Vector continue aggregating lines until the condition_pattern is no longer valid (excluding the invalid line).
condition_pattern, set to ^[\s]+from, tells Vector to continue aggregating lines if they start with white-space followed by from.

Example 2: Line Continuations

Some programming languages use the backslash (\) character to signal that a line will continue on the next line:

First line\
second line\
third line

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
	type = "file"
	# ...

	[sources.my_file_source.multiline]
		start_pattern = '\\$'
		mode = "continue_past"
		condition_pattern = '\\$'
		timeout_ms = 1000

start_pattern, set to \\$, tells Vector that new multi-line events start with lines that end in \.
mode, set to continue_past, tells Vector continue aggregating lines, plus one additional line, until condition_pattern is false.
condition_pattern, set to \\$, tells Vector to continue aggregating lines if they end with a \ character.

Example 3: Line Continuations

Activity logs from services such as Elasticsearch typically begin with a timestamp, followed by information on the specific activity, as in this example:

[2015-08-24 11:49:14,389][ INFO ][env                      ] [Letha] using [1] data paths, mounts [[/
(/dev/disk1)]], net usable_space [34.5gb], net total_space [118.9gb], types [hfs]

To consume these lines as a single event, use the following Vector configuration:

[sources.my_file_source]
type = "file"
# ...

[sources.my_file_source.multiline]
start_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
mode = "halt_before"
condition_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
timeout_ms = 1000

start_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector that new multi-line events start with a timestamp sequence.
mode, set to halt_before, tells Vector to continue aggregating lines as long as the condition_pattern does not match.
condition_pattern, set to ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}, tells Vector to continue aggregating up until a line starts with a timestamp sequence.

File permissions

To be able to source events from the files, Vector must be able to read the files and execute their parent directories.

If you have deployed Vector as using one our distributed packages, then you will find Vector running as the vector user. You should ensure this user has read access to the desired files used as include. Strategies for this include:

Create a new unix group, make it the group owner of the target files, with read access, and add vector to that group
Use POSIX ACLs to grant access to the files to the vector user
Grant the CAP_DAC_READ_SEARCH Linux capability. This capability bypasses the file system permissions checks to allow Vector to read any file. This is not recommended as it gives Vector more permissions than it requires, but it is recommended over running Vector as root which would grant it even broader permissions. This can be granted via SystemD by creating an override file using systemctl edit vector and adding:
```
AmbientCapabilities=CAP_DAC_READ_SEARCH
CapabilityBoundingSet=CAP_DAC_READ_SEARCH
```

On Debian-based distributions, the vector user is automatically added to the adm group, if it exists, which has permissions to read /var/log.

Read Position

By default, Vector will read from the beginning of newly discovered files. You can change this behavior by setting the read_from option to "end".

Previously discovered files will be checkpointed, and the read position will resume from the last checkpoint. To disable this behavior, you can set the ignore_checkpoints option to true. This will cause Vector to disregard existing checkpoints when determining the starting read position of a file.

State

This component is stateless, meaning its behavior is consistent across each input.