File external table

File external table is a special type of external table. It allows you to directly query Parquet and ORC data files in external storage systems without loading data into StarRocks. In addition, file external tables do not rely on a metastore. In the current version, StarRocks supports the following external storage systems: HDFS, Amazon S3, and other S3-compatible storage systems.

This feature is supported from StarRocks v2.5.

Limits

File external tables must be created in databases within the default_catalog. You can run SHOW CATALOGS to query catalogs created in the cluster.
Only Parquet and ORC data files are supported.
You can only use file external tables to query data in the target data file. Data write operations such as INSERT, DELETE, and DROP are not supported.

Prerequisites

Before you create a file external table, you must configure your StarRocks cluster so that StarRocks can access the external storage system where the target data file is stored. The configurations required for a file external table are the same as those required for a Hive catalog, except that you do not need to configure a metastore. See Hive catalog - Integration preparations for more information about configurations.

Create a database (Optional)

After connecting to your StarRocks cluster, you can create a file external table in an existing database or create a new database to manage file external tables. To query existing databases in the cluster, run SHOW DATABASES. Then you can run USE <db_name> to switch to the target database.

The syntax for creating a database is as follows.

CREATE DATABASE [IF NOT EXISTS] <db_name>

Create a file external table

After accessing the target database, you can create a file external table in this database.

Syntax

CREATE EXTERNAL TABLE <table_name>
(
    <col_name> <col_type> [NULL | NOT NULL] [COMMENT "<comment>"]
) 
ENGINE=file
COMMENT ["comment"]
PROPERTIES
(
    FileLayoutParams,
    StorageCredentialParams
)

Parameters

Parameter	Required	Description
table_name	Yes	The name of the file external table. The naming conventions are as follows: The name can contain letters, digits (0-9), and underscores (_). It must start with a letter. The name cannot exceed 64 characters in length.
col_name	Yes	The column name in the file external table. The column names in the file external table must be the same as those in the target data file but are not case-sensitive. The order of columns in the file external table can be different from that in the target data file.
col_type	Yes	The column type in the file external table. You need to specify this parameter based on the column type in the target data file. For more information, see Mapping of column types .
NULL \| NOT NULL	No	Whether the column in the file external table is allowed to be NULL. NULL: NULL is allowed. NOT NULL: NULL is not allowed. You must specify this modifier based on the following rules: If this parameter is not specified for the columns in the target data file, you can choose not to specify it for the columns in the file external table or specify NULL for the columns in the file external table. If NULL is specified for the columns in the target data file, you can choose not to specify this parameter for the columns in the file external table or specify NULL for the columns in the file external table. If NOT NULL is specified for the columns in the target data file, you must also specify NOT NULL for the columns in the file external table.
comment	No	The comment of column in the file external table.
ENGINE	Yes	The type of engine. Set the value to file.
comment	No	The description of the file external table.
PROPERTIES	Yes	`FileLayoutParams`: specifies the path and format of the target file. This property is required. `StorageCredentialParams`: specifies the authentication information required for accessing object storage systems. This property is required only for AWS S3 and other S3-compatible storage systems.

FileLayoutParams

A set of parameters for accessing the target data file.

"path" = "<file_path>",
"format" = "<file_format>"
"enable_recursive_listing" = "{ true | false }"

Parameter	Required	Description
path	Yes	The path of the data file. If the data file is stored in HDFS, the path format is `hdfs://<IP address of HDFS>:<port>/<path>`. The default port number is 8020. If you use the default port, you do not need to specify it. If the data file is stored in AWS S3 or other S3-compatible storage system, the path format is `s3://<bucket name>/<folder>/`. Note the following rules when you enter the path: If you want to access all files in a path, end this parameter with a slash (`/`), such as `hdfs://x.x.x.x/user/hive/warehouse/array2d_parq/data/`. When you run a query, StarRocks traverses all data files under the path. It does not traverse data files by using recursion. If you want to access a single file, enter a path that directly points to this file, such as `hdfs://x.x.x.x/user/hive/warehouse/array2d_parq/data`. When you run a query, StarRocks only scans this data file.
format	Yes	The format of the data file. Only Parquet and ORC are supported.
enable_recursive_listing	No	Specifies whether to recursively transverse all files under the current path. Default value: false.

Parameter

Required

Description

path

Yes

The path of the data file.

If the data file is stored in HDFS, the path format is hdfs://<IP address of HDFS>:<port>/<path>. The default port number is 8020. If you use the default port, you do not need to specify it.
If the data file is stored in AWS S3 or other S3-compatible storage system, the path format is s3://<bucket name>/<folder>/.

Note the following rules when you enter the path:

If you want to access all files in a path, end this parameter with a slash (/), such as hdfs://x.x.x.x/user/hive/warehouse/array2d_parq/data/. When you run a query, StarRocks traverses all data files under the path. It does not traverse data files by using recursion.
If you want to access a single file, enter a path that directly points to this file, such as hdfs://x.x.x.x/user/hive/warehouse/array2d_parq/data. When you run a query, StarRocks only scans this data file.

format

Yes

The format of the data file. Only Parquet and ORC are supported.

enable_recursive_listing

Specifies whether to recursively transverse all files under the current path. Default value: false.

StorageCredentialParams (Optional)

A set of parameters about how StarRocks integrates with the target storage system. This parameter set is optional.

You need to configure StorageCredentialParams only when the target storage system is AWS S3 or other S3-compatible storage.

For other storage systems, you can ignore StorageCredentialParams.

AWS S3

If you need to access a data file stored in AWS S3, configure the following authentication parameters in StorageCredentialParams.

If you choose the instance profile-based authentication method, configure StorageCredentialParams as follows:

"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<aws_s3_region>"

If you choose the assumed role-based authentication method, configure StorageCredentialParams as follows:

"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "<ARN of your assumed role>",
"aws.s3.region" = "<aws_s3_region>"

If you choose the IAM user-based authentication method, configure StorageCredentialParams as follows:

"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "<iam_user_access_key>",
"aws.s3.secret_key" = "<iam_user_secret_key>",
"aws.s3.region" = "<aws_s3_region>"

Parameter name	Required	Description
aws.s3.use_instance_profile	Yes	Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method when you access AWS S3. Valid values: `true` and `false`. Default value: `false`.
aws.s3.iam_role_arn	Yes	The ARN of the IAM role that has privileges on your AWS S3 bucket. If you use the assumed role-based authentication method to access AWS S3, you must specify this parameter. Then, StarRocks will assume this role when it accesses the target data file.
aws.s3.region	Yes	The region in which your AWS S3 bucket resides. Example: us-west-1.
aws.s3.access_key	No	The access key of your IAM user. If you use the IAM user-based authentication method to access AWS S3, you must specify this parameter.
aws.s3.secret_key	No	The secret key of your IAM user. If you use the IAM user-based authentication method to access AWS S3, you must specify this parameter.

For information about how to choose an authentication method for accessing AWS S3 and how to configure an access control policy in the AWS IAM Console, see Authentication parameters for accessing AWS S3.

S3-compatible storage

If you need to access an S3-compatible storage system, such as MinIO, configure StorageCredentialParams as follows to ensure a successful integration:

"aws.s3.enable_ssl" = "false",
"aws.s3.enable_path_style_access" = "true",
"aws.s3.endpoint" = "<s3_endpoint>",
"aws.s3.access_key" = "<iam_user_access_key>",
"aws.s3.secret_key" = "<iam_user_secret_key>"

The following table describes the parameters you need to configure in StorageCredentialParams.

Parameter	Required	Description
aws.s3.enable_ssl	Yes	Specifies whether to enable SSL connection. Valid values: `true` and `false`. Default value: `true`.
aws.s3.enable_path_style_access	Yes	Specifies whether to enable path-style access. Valid values: `true` and `false`. Default value: `false`. For MinIO, you must set the value to `true`. Path-style URLs use the following format: `https://s3.<region_code>.amazonaws.com/<bucket_name>/<key_name>`. For example, if you create a bucket named `DOC-EXAMPLE-BUCKET1` in the US West (Oregon) Region, and you want to access the `alice.jpg` object in that bucket, you can use the following path-style URL: `https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1/alice.jpg`.
aws.s3.endpoint	Yes	The endpoint used to connect to an S3-compatible storage system instead of AWS S3.
aws.s3.access_key	Yes	The access key of your IAM user.
aws.s3.secret_key	Yes	The secret key of your IAM user.

Mapping of column types

The following table provides the mapping of column types between the target data file and the file external table.

Data file	File external table
INT/INTEGER	INT
BIGINT	BIGINT
TIMESTAMP	DATETIME. Note that TIMESTAMP is converted to DATETIME without a time zone based on the time zone setting of the current session and loses some of its precision.
STRING	STRING
VARCHAR	VARCHAR
CHAR	CHAR
DOUBLE	DOUBLE
FLOAT	FLOAT
DECIMAL	DECIMAL
BOOLEAN	BOOLEAN
ARRAY	ARRAY
MAP	MAP
STRUCT	STRUCT

Examples

HDFS

Create a file external table named t0 to query Parquet data files stored in an HDFS path.

USE db_example;

CREATE EXTERNAL TABLE t0
(
    name string, 
    id int
) 
ENGINE=file
PROPERTIES 
(
    "path"="hdfs://x.x.x.x:8020/user/hive/warehouse/person_parq/", 
    "format"="parquet"
);

AWS S3

Example 1: Create a file external table and use instance profile to access a single Parquet file in AWS S3.

USE db_example;

CREATE EXTERNAL TABLE table_1
(
    name string, 
    id int
) 
ENGINE=file
PROPERTIES 
(
    "path" = "s3://bucket-test/folder1/raw_0.parquet", 
    "format" = "parquet",
    "aws.s3.use_instance_profile" = "true",
    "aws.s3.region" = "us-west-2" 
);

Example 2: Create a file external table and use assumed role to access all the ORC files under the target file path in AWS S3.

USE db_example;

CREATE EXTERNAL TABLE table_1
(
    name string, 
    id int
) 
ENGINE=file
PROPERTIES 
(
    "path" = "s3://bucket-test/folder1/", 
    "format" = "orc",
    "aws.s3.use_instance_profile" = "true",
    "aws.s3.iam_role_arn" = "arn:aws:iam::51234343412:role/role_name_in_aws_iam",
    "aws.s3.region" = "us-west-2" 
);

Example 3: Create a file external table and use IAM user to access all the ORC files under the file path in AWS S3.

USE db_example;

CREATE EXTERNAL TABLE table_1
(
    name string, 
    id int
) 
ENGINE=file
PROPERTIES 
(
    "path" = "s3://bucket-test/folder1/", 
    "format" = "orc",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "<iam_user_access_key>",
    "aws.s3.secret_key" = "<iam_user_access_key>",
    "aws.s3.region" = "us-west-2" 
);

Query a file external table

Syntax:

SELECT <clause> FROM <file_external_table>

For example, to query data from the file external table t0 created in Examples - HDFS, run the following command:

SELECT * FROM t0;

+--------+------+
| name   | id   |
+--------+------+
| jack   |    2 |
| lily   |    1 |
+--------+------+
2 rows in set (0.08 sec)

Manage file external tables

You can view the schema of the table using DESC or drop the table using DROP TABLE.

File external table

Limits​

Prerequisites​

Create a database (Optional)​

Create a file external table​

Syntax​

Parameters​

FileLayoutParams​

StorageCredentialParams (Optional)​

AWS S3​

S3-compatible storage​

Mapping of column types​

Examples​

HDFS​

AWS S3​

Query a file external table​

Manage file external tables​

Limits

Prerequisites

Create a database (Optional)

Create a file external table

Syntax

Parameters

FileLayoutParams

StorageCredentialParams (Optional)

AWS S3

S3-compatible storage

Mapping of column types

Examples

HDFS

AWS S3

Query a file external table

Manage file external tables