AWS S3 Source

The Vector aws_s3 source collects logs from AWS S3.

Requirements

Configuration

[sources.my_source_id]
# General
type = "aws_s3" # required
region = "us-east-1" # required, required when endpoint = null
# Sqs
sqs.delete_message = true # optional, default
sqs.poll_secs = 15 # optional, default, seconds
sqs.queue_url = "https://sqs.us-east-2.amazonaws.com/123456789012/MyQueue" # required
  • optionaltable

    auth

    Options for the authentication strategy.

    • optionalstring

      access_key_id

      The AWS access key id. Used for AWS authentication when communicating with AWS services. See AWS Authentication for more info.

      • Syntax: literal
    • optionalstring

      assume_role

      The ARN of an IAM role to assume at startup. See AWS Authentication for more info.

      • Syntax: literal
    • optionalstring

      secret_access_key

      The AWS secret access key. Used for AWS authentication when communicating with AWS services. See AWS Authentication for more info.

      • Syntax: literal
  • enumoptionalstring

    compression

    The compression format of the S3 objects..

    • Syntax: literal
    • Default: "text"
    • Enum, must be one of: "auto" "gzip" "zstd" "none"
  • optionalstring

    endpoint

    Custom endpoint for use with AWS-compatible services. Providing a value for this option will make region moot.

    • Syntax: literal
    • Only relevant when: region = null
  • optionaltable

    multiline

    Multiline parsing configuration. If not specified, multiline parsing is disabled. See Handling events from the aws_s3 source for more info.

    • commonrequiredstring

      condition_pattern

      Condition regex pattern to look for. Exact behavior is configured via mode.

      This field accepts a valid [Rust regular expression]urls.rustregex_syntax. Wrapping / characters are _not required or permitted.

      • Syntax: regex
    • enumcommonrequiredstring

      mode

      Mode of operation, specifies how the condition_pattern is interpreted.

      • Syntax: literal
      • Enum, must be one of: "continue_through" "continue_past" "halt_before" "halt_with"
    • commonrequiredstring

      start_pattern

      Start regex pattern to look for as a beginning of the message.

      This field accepts a valid [Rust regular expression]urls.rustregex_syntax. Wrapping / characters are _not required or permitted.

      • Syntax: regex
    • commonrequireduint

      timeout_ms

      The maximum time to wait for the continuation. Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.

  • commonrequired*string

    region

    The AWS region of the target service. If endpoint is provided it will override this value since the endpoint includes the region.

    • Syntax: literal
    • Only required when: endpoint = null
  • commonoptionaltable

    sqs

    SQS strategy options. Required if strategy=sqs.

    • commonoptionalbool

      delete_message

      Whether to delete the message once Vector processes it. It can be useful to set this to false to debug or during initial Vector setup.

      • Default: true
    • commonoptionaluint

      poll_secs

      How often to poll the queue for new messages in seconds.

      • Default: 15 (seconds)
    • commonrequiredstring

      queue_url

      The URL of the SQS queue to receieve bucket notifications from.

      • Syntax: literal
    • optionaluint

      visibility_timeout_secs

      The visibility timeout to use for messages in secords. This controls how long a message is left unavailable when a Vector receives it. If a vector does not delete the message before the timeout expires, it will be made reavailable for another consumer; this can happen if, for example, the vector process crashes.

      • WARNING: Should be set higher than the length of time it takes to process an individual message to avoid that message being reprocessed.
      • Default: 300 (seconds)
  • enumoptionalstring

    strategy

    The strategy to use to consume objects from AWS S3.

    • Syntax: literal
    • Default: "sqs"
    • Enum, must be one of: "sqs"

Env Vars

  • commonoptionalstring

    AWS_ACCESS_KEY_ID

    The AWS access key id. Used for AWS authentication when communicating with AWS services. See AWS Authentication for more info.

    • Syntax: literal
  • commonoptionalstring

    AWS_CONFIG_FILE

    Specifies the location of the file that the AWS CLI uses to store configuration profiles.

    • Syntax: literal
    • Default: "~/.aws/config"
  • commonoptionalstring

    AWS_CREDENTIAL_EXPIRATION

    Expiration time in RFC 3339 format. If unset, credentials won't expire.

    • Syntax: literal
  • commonoptionalstring

    AWS_DEFAULT_REGION

    The default AWS region.

    • Syntax: literal
    • Only relevant when: endpoint = null
  • commonoptionalstring

    AWS_PROFILE

    Specifies the name of the CLI profile with the credentials and options to use. This can be the name of a profile stored in a credentials or config file.

    • Syntax: literal
    • Default: "default"
  • commonoptionalstring

    AWS_ROLE_SESSION_NAME

    Specifies a name to associate with the role session. This value appears in CloudTrail logs for commands performed by the user of this profile.

    • Syntax: literal
  • commonoptionalstring

    AWS_SECRET_ACCESS_KEY

    The AWS secret access key. Used for AWS authentication when communicating with AWS services. See AWS Authentication for more info.

    • Syntax: literal
  • commonoptionalstring

    AWS_SESSION_TOKEN

    The AWS session token. Used for AWS authentication when communicating with AWS services.

    • Syntax: literal
  • commonoptionalstring

    AWS_SHARED_CREDENTIALS_FILE

    Specifies the location of the file that the AWS CLI uses to store access keys.

    • Syntax: literal
    • Default: "~/.aws/credentials"

Output

This component outputs log events with the following fields:

{
"bucket" : "my-bucket",
"message" : "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
"object" : "AWSLogs/111111111111/vpcflowlogs/us-east-1/2020/10/26/111111111111_vpcflowlogs_us-east-1_fl-0c5605d9f1baf680d_20201026T1950Z_b1ea4a7a.log.gz",
"region" : "us-east-1",
"timestamp" : "2020-10-10T17:07:36+00:00"
}
  • commonrequiredstring

    bucket

    The bucket of the object the line came from.

    • Syntax: literal
  • commonrequiredstring

    message

    A line from the S3 object.

    • Syntax: literal
  • commonrequiredstring

    object

    The object the line came from.

    • Syntax: literal
  • commonrequiredstring

    region

    The AWS region bucket is in.

    • Syntax: literal
  • commonrequiredtimestamp

    timestamp

    The Last-Modified time of the object. Defaults the current timestamp if this information is missing.

Telemetry

This component provides the following metrics that can be retrieved through the internal_metrics source. See the metrics section in the monitoring page for more info.

  • counter

    events_in_total

    The number of events accepted by this component either from tagged origin like file and uri, or cumulatively from other origins. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • container_name - The name of the container from which the event originates.

    • file - The file from which the event originates.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

    • mode - The connection mode used by the component.

    • peer_addr - The IP from which the event originates.

    • peer_path - The pathname from which the event originates.

    • pod_name - The name of the pod from which the event originates.

    • uri - The sanitized uri from which the event originates.

  • counter

    processed_bytes_total

    The number of bytes processed by the component. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • container_name - The name of the container from which the bytes originate.

    • file - The file from which the bytes originate.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

    • mode - The connection mode used by the component.

    • peer_addr - The IP from which the bytes originate.

    • peer_path - The pathname from which the bytes originate.

    • pod_name - The name of the pod from which the bytes originate.

    • uri - The sanitized uri from which the bytes originate.

  • counter

    sqs_message_delete_failed_total

    The total number of failures to delete SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_delete_succeeded_total

    The total number of successful deletions of SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_processing_failed_total

    The total number of failures to process SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_processing_succeeded_total

    The total number of SQS messages successfully processed. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_receive_failed_total

    The total number of failures to receive SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_receive_succeeded_total

    The total number of times successfully receiving SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_message_received_messages_total

    The total number of received SQS messages. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    events_out_total

    The total number of events emitted by this component. This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

  • counter

    sqs_s3_event_record_ignored_total

    The total number of times an S3 record in an SQS message was ignored (for an event that was not ObjectCreated). This metric includes the following tags:

    • component_kind - The Vector component kind.

    • component_name - The Vector component ID.

    • component_type - The Vector component type.

    • ignore_type - The reason for ignoring the S3 record

    • instance - The Vector instance identified by host and port.

    • job - The name of the job producing Vector metrics.

How It Works

AWS Authentication

Vector checks for AWS credentials in the following order:

  1. Options access_key_id and secret_access_key.
  2. Environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.
  3. The credential_process command in the AWS config file. (usually located at ~/.aws/config)
  4. The AWS credentials file. (usually located at ~/.aws/credentials)
  5. The IAM instance profile. (will only work if running on an EC2 instance with an instance profile/role)

If credentials are not found the healtcheck will fail and an error will be logged.

Obtaining an access key

In general, we recommend using instance profiles/roles whenever possible. In cases where this is not possible you can generate an AWS access key for any user within your AWS account. AWS provides a detailed guide on how to do this. Such created AWS access keys can be used via access_key_id and secret_access_key options.

Assuming roles

Vector can assume an AWS IAM role via the assume_role option. This is an optional setting that is helpful for a variety of use cases, such as cross account access.

Context

By default, the aws_s3 source will augment events with helpful context keys as shown in the "Output" section.

Handling events from the aws_s3 source

This source behaves very similarly to the file source in that it will output one event per line (unless the multiline configuration option is used).

You will commonly want to use transforms to parse the data. For example, to parse VPC flow logs sent to S3 you can chain the tokenizer transform:

[transforms.flow_logs]
type = "tokenizer" # required
inputs = ["s3"]
field_names = ["version", "account_id", "interface_id", "srcaddr", "dstaddr", "srcport", "dstport", "protocol", "packets", "bytes", "start", "end", "action", "log_status"]
types.srcport = "int"
types.dstport = "int"
types.packets = "int"
types.bytes = "int"
types.start = "timestamp|%s"
types.end = "timestamp|%s"

To parse AWS load balancer logs, the regex_parser transform can be used:

[transforms.elasticloadbalancing_fields_parsed]
type = "regex_parser"
inputs = ["s3"]
regex = '(?x)^
(?P<type>[\w]+)[ ]
(?P<timestamp>[\w:.-]+)[ ]
(?P<elb>[^\s]+)[ ]
(?P<client_host>[\d.:-]+)[ ]
(?P<target_host>[\d.:-]+)[ ]
(?P<request_processing_time>[\d.-]+)[ ]
(?P<target_processing_time>[\d.-]+)[ ]
(?P<response_processing_time>[\d.-]+)[ ]
(?P<elb_status_code>[\d-]+)[ ]
(?P<target_status_code>[\d-]+)[ ]
(?P<received_bytes>[\d-]+)[ ]
(?P<sent_bytes>[\d-]+)[ ]
"(?P<request_method>[\w-]+)[ ]
(?P<request_url>[^\s]+)[ ]
(?P<request_protocol>[^"\s]+)"[ ]
"(?P<user_agent>[^"]+)"[ ]
(?P<ssl_cipher>[^\s]+)[ ]
(?P<ssl_protocol>[^\s]+)[ ]
(?P<target_group_arn>[\w.:/-]+)[ ]
"(?P<trace_id>[^\s"]+)"[ ]
"(?P<domain_name>[^\s"]+)"[ ]
"(?P<chosen_cert_arn>[\w:./-]+)"[ ]
(?P<matched_rule_priority>[\d-]+)[ ]
(?P<request_creation_time>[\w.:-]+)[ ]
"(?P<actions_executed>[\w,-]+)"[ ]
"(?P<redirect_url>[^"]+)"[ ]
"(?P<error_reason>[^"]+)"'
field = "message"
drop_failed = false
types.received_bytes = "int"
types.request_processing_time = "float"
types.sent_bytes = "int"
types.target_processing_time = "float"
types.response_processing_time = "float"
[transforms.elasticloadbalancing_url_parsed]
type = "regex_parser"
inputs = ["elasticloadbalancing_fields_parsed"]
regex = '^(?P<url_scheme>[\w]+)://(?P<url_hostname>[^\s:/?#]+)(?::(?P<request_port>[\d-]+))?-?(?:/(?P<url_path>[^\s?#]*))?(?P<request_url_query>\?[^\s#]+)?'
field = "request_url"
drop_failed = false

State

This component is stateless, meaning its behavior is consistent across each input.