External table
(To be deprecated) External table
StarRocks supports access to other data sources by using external tables. External tables are created based on data tables that are stored in other data sources. StarRocks only stores the metadata of the data tables. You can use external tables to directly query data in other data sources. StarRocks supports the following data sources: MySQL, StarRocks, Elasticsearch, Apache Hive™, Apache Iceberg, and Apache Hudi. Currently, you can only write data from another StarRocks cluster into the current StarRocks cluster. You cannot read data from it. For data sources other than StarRocks, you can only read data from these data sources.
NOTICE
- From v3.0 onwards, we recommend that you use catalogs to query data from Hive, Iceberg, and Hudi. See Hive catalog, Iceberg catalog, and Hudi catalog.
- From v3.1 onwards, we recommend that you use JDBC catalog to query data from MySQL and PostgreSQL and use Elasticsearch catalog to query data from Elasticsearch.
From 2.5 onwards, StarRocks provides the Data Cache feature, which accelerates hot data queriers on external data sources. For more information, see Data Cache.
StarRocks external table
From StarRocks 1.19 onwards, StarRocks allows you to use a StarRocks external table to write data from one StarRocks cluster to another. This achieves read-write separation and provides better resource isolation. You can first create a destination table in the destination StarRocks cluster. Then, in the source StarRocks cluster, you can create a StarRocks external table that has the same schema as the destination table and specify the information of the destination cluster and table in the PROPERTIES
field.
Data can be written from a source cluster to a destination cluster by using INSERT INTO statement to write into a StarRocks external table. It can help realize the following goals:
- Data synchronization between StarRocks clusters.
- Read-write separation. Data is written to the source cluster, and data changes from the source cluster are synchronized to the destination cluster, which provides query services.
The following code shows how to create a destination table and an external table.
# Create a destination table in the destination StarRocks cluster.
CREATE TABLE t
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=olap
DISTRIBUTED BY HASH(k1);
# Create an external table in the source StarRocks cluster.
CREATE EXTERNAL TABLE external_t
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=olap
DISTRIBUTED BY HASH(k1)
PROPERTIES
(
"host" = "127.0.0.1",
"port" = "9020",
"user" = "user",
"password" = "passwd",
"database" = "db_test",
"table" = "t"
);
# Write data from a source cluster to a destination cluster by writing data into the StarRocks external table. The second statement is recommended for the production environment.
insert into external_t values ('2020-10-11', 1, 1, 'hello', '2020-10-11 10:00:00');
insert into external_t select * from other_table;
Parameters:
- EXTERNAL: This keyword indicates that the table to be created is an external table.
- host: This parameter specifies the IP address of the leader FE node of the destination StarRocks cluster.
- port: This parameter specifies the RPC port of the leader FE node of the destination StarRocks cluster. You can set this parameter based on the rpc_port configuration in the fe/fe.conf file.
- user: This parameter specifies the username used to access the destination StarRocks cluster.
- password: This parameter specifies the password used to access the destination StarRocks cluster.
- database: This parameter specifies the database to which the destination table belongs.
- table: This parameter specifies the name of the destination table.
The following limits apply when you use a StarRocks external table:
- You can only run the INSERT INTO and SHOW CREATE TABLE commands on a StarRocks external table. Other data writing methods are not supported. In addition, you cannot query data from a StarRocks external table or perform DDL operations on the external table.
- The syntax of creating an external table is the same as creating a normal table, but the column names and other information in the external table must be the same as the destination table.
- The external table synchronizes table metadata from the destination table every 10 seconds. If a DDL operation is performed on the destination table, there may be a delay for data synchronization between the two tables.
External table for a JDBC-compatible database
From v2.3.0, StarRocks provides external tables to query JDBC-compatible databases. This way, you can analyze the data of such databases in a blazing fast manner without the need to \import the data into StarRocks. This section describes how to create an external table in StarRocks and query data in JDBC-compatible databases.
Prerequisites
Before you use a JDBC external table to query data, make sure that the FEs and BEs have access to the download URL of the JDBC driver. The download URL is specified by the driver_url
parameter in the statement used for creating the JDBC resource.
Create and manage JDBC resources
Create a JDBC resource
Before you create an external table to query data from a database, you need to create a JDBC resource in StarRocks to manage the connection information of the database. The database must support the JDBC driver and is referred as the "target database". After creating the resource, you can use it to create external tables.
Execute the following statement to create a JDBC resource named jdbc0
:
create external resource jdbc0
properties (
"type"="jdbc",
"user"="postgres",
"password"="changeme",
"jdbc_uri"="jdbc:postgresql://127.0.0.1:5432/jdbc_test",
"driver_url"="https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.3/postgresql-42.3.3.jar",
"driver_class"="org.postgresql.Driver"
);
The required parameters in properties
are as follows:
type
: the type of the resource. Set the value tojdbc
.user
: the username that is used to connect to the target database.password
: the password that is used to connect to the target database.jdbc_uri
: the URI that the JDBC driver uses to connect to the target database. The URI format must satisfy the database URI syntax. For the URI syntax of some common databases, visit the official websites of MySQL, Oracle, PostgreSQL, SQL Server.
Note: The URI must include the name of the target database. For example, in the preceding code example,
jdbc_test
is the name of the target database that you want to connect.
driver_url
: the download URL of the JDBC driver JAR package. An HTTP URL or file URL is supported, for example,https://repo1.maven.org/maven2/org/postgresql/postgresql/42.3.3/postgresql-42.3.3.jar
orfile:///home/disk1/postgresql-42.3.3.jar
.driver_class
: the class name of the JDBC driver. The JDBC driver class names of common databases are as follows:- MySQL: com.mysql.jdbc.Driver (MySQL 5.x and earlier), com.mysql.cj.jdbc.Driver (MySQL 6.x and later)
- SQL Server: com.microsoft.sqlserver.jdbc.SQLServerDriver
- Oracle: oracle.jdbc.driver.OracleDriver
- PostgreSQL: org.postgresql.Driver
When the resource is being created, the FE downloads the JDBC driver JAR package by using the URL that is specified in the driver_url
parameter, generates a checksum, and uses the checksum to verify the JDBC driver downloaded by BEs.
Note: If the download of the JDBC driver JAR package fails, the creation of the resource also fails.
When BEs query the JDBC external table for the first time and find that the corresponding JDBC driver JAR package does not exist on their machines, BEs download the JDBC driver JAR package by using the URL that is specified in the driver_url
parameter, and all JDBC driver JAR packages are saved in the ${STARROCKS_HOME}/lib/jdbc_drivers
directory.
View JDBC resources
Execute the following statement to view all JDBC resources in StarRocks:
SHOW RESOURCES;
Note: The
ResourceType
column isjdbc
.
Delete a JDBC resource
Execute the following statement to delete the JDBC resource named jdbc0
:
DROP RESOURCE "jdbc0";
Note: After a JDBC resource is deleted, all JDBC external tables that are created by using that JDBC resource are unavailable. However, the data in the target database is not lost. If you still need to use StarRocks to query data in the target database, you can create the JDBC resource and the JDBC external tables again.
Create a database
Execute the following statement to create and access a database named jdbc_test
in StarRocks:
CREATE DATABASE jdbc_test;
USE jdbc_test;
Note: The database name that you specify in the preceding statement does not need to be same as the name of the target database.
Create a JDBC external table
Execute the following statement to create a JDBC external table named jdbc_tbl
in the database jdbc_test
:
create external table jdbc_tbl (
`id` bigint NULL,
`data` varchar(200) NULL
) ENGINE=jdbc
properties (
"resource" = "jdbc0",
"table" = "dest_tbl"
);
The required parameters in properties
are as follows:
resource
: the name of the JDBC resource used to create the external table.table
: the target table name in the database.
For supported data types and data type mapping between StarRocks and target databases, see [Data type mapping](External_table.md#Data type mapping).
Note:
- Indexes are not supported.
- You cannot use PARTITION BY or DISTRIBUTED BY to specify data distribution rules.
Query a JDBC external table
Before you query JDBC external tables, you must execute the following statement to enable the Pipeline engine:
set enable_pipeline_engine=true;
Note: If the Pipeline engine is already enabled, you can skip this step.
Execute the following statement to query the data in the target database by using JDBC external tables.
select * from JDBC_tbl;
StarRocks supports predicate pushdown by pushing down filter conditions to the target table. Executing filter conditions as close as possible to the data source can improve query performance. Currently, StarRocks can push down operators, including the binary comparison operators (>
, >=
, =
, <
, and <=
), IN
, IS NULL
, and BETWEEN ... AND ...
. However, StarRocks can not push down functions.
Data type mapping
Currently, StarRocks can only query data of basic types in the target database, such as NUMBER, STRING, TIME, and DATE. If the ranges of data values in the target database are not supported by StarRocks, the query reports an error.
The mapping between the target database and StarRocks varies based on the type of the target database.
MySQL and StarRocks
MySQL | StarRocks |
---|---|
BOOLEAN | BOOLEAN |
TINYINT | TINYINT |
SMALLINT | SMALLINT |
MEDIUMINTINT | INT |
BIGINT | BIGINT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
DECIMAL | DECIMAL |
CHAR | CHAR |
VARCHAR | VARCHAR |
DATE | DATE |
DATETIME | DATETIME |
Oracle and StarRocks
Oracle | StarRocks |
---|---|
CHAR | CHAR |
VARCHARVARCHAR2 | VARCHAR |
DATE | DATE |
SMALLINT | SMALLINT |
INT | INT |
BINARY_FLOAT | FLOAT |
BINARY_DOUBLE | DOUBLE |
DATE | DATE |
DATETIME | DATETIME |
NUMBER | DECIMAL |
PostgreSQL and StarRocks
PostgreSQL | StarRocks |
---|---|
SMALLINTSMALLSERIAL | SMALLINT |
INTEGERSERIAL | INT |
BIGINTBIGSERIAL | BIGINT |
BOOLEAN | BOOLEAN |
REAL | FLOAT |
DOUBLE PRECISION | DOUBLE |
DECIMAL | DECIMAL |
TIMESTAMP | DATETIME |
DATE | DATE |
CHAR | CHAR |
VARCHAR | VARCHAR |
TEXT | VARCHAR |
SQL Server and StarRocks
SQL Server | StarRocks |
---|---|
BOOLEAN | BOOLEAN |
TINYINT | TINYINT |
SMALLINT | SMALLINT |
INT | INT |
BIGINT | BIGINT |
FLOAT | FLOAT |
REAL | DOUBLE |
DECIMALNUMERIC | DECIMAL |
CHAR | CHAR |
VARCHAR | VARCHAR |
DATE | DATE |
DATETIMEDATETIME2 | DATETIME |
Limits
- When you create JDBC external tables, you cannot create indexes on the tables or use PARTITION BY and DISTRIBUTED BY to specify data distribution rules for the tables.
- When you query JDBC external tables, StarRocks cannot push down functions to the tables.
(Deprecated) Elasticsearch external table
StarRocks and Elasticsearch are two popular analytics systems. StarRocks is performant in large-scale distributed computing. Elasticsearch is ideal for full-text search. StarRocks combined with Elasticsearch can deliver a more complete OLAP solution.
Example of creating an Elasticsearch external table
Syntax
CREATE EXTERNAL TABLE elastic_search_external_table
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:9200,http://192.168.0.2:9200",
"user" = "root",
"password" = "root",
"index" = "tindex",
"type" = "_doc",
"es.net.ssl" = "true"
);
The following table describes the parameters.
Parameter | Required | Default value | Description |
---|---|---|---|
hosts | Yes | None | The connection address of the Elasticsearch cluster. You can specify one or more addresses. StarRocks can parse the Elasticsearch version and index shard allocation from this address. StarRocks communicates with your Elasticsearch cluster based on the address returned by the |
index | Yes | None | The name of the Elasticsearch index that is created on the table in StarRocks. The name can be an alias. This parameter supports wildcards (*). For example, if you set |
user | No | Empty | The username that is used to log in to the Elasticsearch cluster with basic authentication enabled. Make sure you have access to |
password | No | Empty | The password that is used to log in to the Elasticsearch cluster. |
type | No |
| The type of the index. Default value: |
es.nodes.wan.only | No |
| Specifies whether StarRocks only uses the addresses specified by
|
es.net.ssl | No |
| Specifies whether the HTTPS protocol can be used to access your Elasticsearch cluster. Only StarRocks 2.4 and later versions support configuring this parameter.
|
enable_docvalue_scan | No |
| Specifies whether to obtain the values of the target fields from Elasticsearch columnar storage. In most cases, reading data from columnar storage outperforms reading data from row storage. |
enable_keyword_sniff | No |
| Specifies whether to sniff TEXT-type fields in Elasticsearch based on KEYWORD-type fields. If this parameter is set to |
Columnar scan for faster queries
If you set enable_docvalue_scan
to true
, StarRocks follows these rules when it obtains data from Elasticsearch:
- Try and see: StarRocks automatically checks if columnar storage is enabled for the target fields. If so, StarRocks obtains all values in the target fields from columnar storage.
- Auto-downgrading: If any one of the target fields is unavailable in columnar storage, StarRocks parses and obtains all values in the target fields from row storage (
_source
).
NOTE
- Columnar storage is unavailable for TEXT-type fields in Elasticsearch. Therefore, if you query fields containing TEXT-type values, StarRocks obtains the values of the fields from
_source
.- If you query a large number (greater than or equal to 25) of fields, reading field values from
docvalue
does not show noticeable benefits compared with reading field values from_source
.
Sniff KEYWORD-type fields
If you set enable_keyword_sniff
to true
, Elasticsearch allows direct data ingestion without an index because it will automatically create an index after ingestion. For STRING-type fields, Elasticsearch will create a field with both TEXT and KEYWORD types. This is how the Multi-Field feature of Elasticsearch works. The mapping is as follows:
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
For example, to conduct "=" filtering on k4
, StarRocks on Elasticsearch will convert the filtering operation into an Elasticsearch TermQuery.
The original SQL filter is as follows:
k4 = "StarRocks On Elasticsearch"
The converted Elasticsearch query DSL is as follows:
"term" : {
"k4": "StarRocks On Elasticsearch"
}
The first field of k4
is TEXT, and it will be tokenized by the analyzer configured for k4
(or by the standard analyzer if no analyzer has been configured for k4
) after data ingestion. As a result, the first field will be tokenized into three terms: StarRocks
, On
, and Elasticsearch
. The details are as follows:
POST /_analyze
{
"analyzer": "standard",
"text": "StarRocks On Elasticsearch"
}
The tokenization results are as follows:
{
"tokens": [
{
"token": "starrocks",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "on",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elasticsearch",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Suppose you conduct a query as follows:
"term" : {
"k4": "StarRocks On Elasticsearch"
}
There is no term in the dictionary that matches the term StarRocks On Elasticsearch
, and therefore no result will be returned.
However, if you have set enable_keyword_sniff
to true
, StarRocks will convert k4 = "StarRocks On Elasticsearch"
to k4.keyword = "StarRocks On Elasticsearch"
to match the SQL semantics. The converted StarRocks On Elasticsearch
query DSL is as follows:
"term" : {
"k4.keyword": "StarRocks On Elasticsearch"
}
k4.keyword
is of the KEYWORD type. Therefore, the data is written into Elasticsearch as a complete term, allowing for successful matching.
Mapping of column data types
When you create an external table, you need to specify the data types of columns in the external table based on the data types of columns in the Elasticsearch table. The following table shows the mapping of column data types.
Elasticsearch | StarRocks |
---|---|
BOOLEAN | BOOLEAN |
BYTE | TINYINT/SMALLINT/INT/BIGINT |
SHORT | SMALLINT/INT/BIGINT |
INTEGER | INT/BIGINT |
LONG | BIGINT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
KEYWORD | CHAR/VARCHAR |
TEXT | CHAR/VARCHAR |
DATE | DATE/DATETIME |
NESTED | CHAR/VARCHAR |
OBJECT | CHAR/VARCHAR |
ARRAY | ARRAY |
Note
- StarRocks reads the data of the NESTED type by using JSON-related functions.
- Elasticsearch automatically flattens multi-dimensional arrays into one-dimensional arrays. StarRocks does the same. The support for querying ARRAY data from Elasticsearch is added from v2.5.
Predicate pushdown
StarRocks supports predicate pushdown. Filters can be pushed down to Elasticsearch for execution, which improves query performance. The following table lists the operators that support predicate pushdown.
SQL syntax | ES syntax |
---|---|
| term query |
| terms query |
| range |
| bool.filter |
| bool.should |
| bool.must_not |
| bool.must_not + terms |
| ES Query DSL |
Examples
The esquery function is used to push down queries that cannot be expressed in SQL (such as match and geoshape) to Elasticsearch for filtering. The first parameter in the esquery function is used to associate an index. The second parameter is a JSON expression of basic Query DSL, which is enclosed in brackets {}. The JSON expression must have but only one root key, such as match, geo_shape, or bool.
- match query
select * from es_table where esquery(k4, '{
"match": {
"k4": "StarRocks on elasticsearch"
}
}');
- geo-related query
select * from es_table where esquery(k4, '{
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates": [
[
13,
53
],
[
14,
52
]
]
},
"relation": "within"
}
}
}');
- bool query
select * from es_table where esquery(k4, ' {
"bool": {
"must": [
{
"terms": {
"k1": [
11,
12
]
}
},
{
"terms": {
"k2": [
100
]
}
}
]
}
}');
Usage notes
- Elasticsearch earlier than 5.x scans data in a different way than that later than 5.x. Currently, only versions later than 5.x are supported.
- Elasticsearch clusters with HTTP basic authentication enabled are supported.
- Querying data from StarRocks may not be as fast as directly querying data from Elasticsearch, such as count-related queries. The reason is that Elasticsearch directly reads the metadata of target documents without the need to filter the real data, which accelerates the count query.
(Deprecated) Hive external table
Before using Hive external tables, make sure JDK 1.8 has been installed on your servers.
Create a Hive resource
A Hive resource corresponds to a Hive cluster. You must configure the Hive cluster used by StarRocks, such as the Hive metastore address. You must specify the Hive resource that is used by the Hive external table.
- Create a Hive resource named hive0.
CREATE EXTERNAL RESOURCE "hive0"
PROPERTIES (
"type" = "hive",
"hive.metastore.uris" = "thrift://10.10.44.98:9083"
);
- View the resources created in StarRocks.
SHOW RESOURCES;
- Delete the resource named
hive0
.
DROP RESOURCE "hive0";