前置条件

hudi 的版本用 0.13.1

为什么选择这个版本，因为 docker 版本的 hudi 是采用 scala2.11 体验的，在 hudi 版本 0.14 时，scala.version 变成了 2.12,这个版本经过多次实验报错。scala 的版本不对。

1
2


hudi  (7a654395) [ 🛤️ 1 ][📦 v0.13.1][☕ v1.8.0]
mvn clean package -Pintegration-tests -DskipTests

注意-Pintegration-tests 一定要添加，只有这个才能正常运行 demo。

编译成功后

1

docker/set_demo.sh dev

运行脚本进行测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92


cker exec -it adhoc-1 /bin/bash
root@adhoc-1:/opt# $SPARK_INSTALL/bin/spark-shell \
>   --jars $HUDI_SPARK_BUNDLE \
>   --master local[2] \
>   --driver-class-path $HADOOP_CONF_DIR \
>   --conf spark.sql.hive.convertMetastoreParquet=false \
>   --deploy-mode client \
>   --driver-memory 1G \
>   --executor-memory 3G \
>   --num-executors 1
23/10/24 05:17:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-1:4040
Spark context available as 'sc' (master = local[2], app id = local-1698124644178).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>  spark.sql("show tables").show(100, false)
+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow   |false      |
|default |stock_ticks_mor_ro|false      |
|default |stock_ticks_mor_rt|false      |
+--------+------------------+-----------+


scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
23/10/24 05:18:11 WARN config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/10/24 05:18:11 WARN config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051251213  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051251213  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala>  spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala>  spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051414103  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051414103  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051414103  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051414103  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> :quit
root@adhoc-1:/opt# exit

运行 trino 失败

1
2


cker exec -it adhoc-2 trino --server trino-coordinator-1:8091
OCI runtime exec failed: exec failed: unable to start container process: exec: "trino": executable file not found in $PATH: unknown

这是因为没有 trino 这个程序，下载 trino 的 jar 包，赋权即可。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


whttps://repo1.maven.org/maven2/io/trino/trino-cli/359/trino-cli-359-executable.jar.jar
--2023-10-24 05:30:17--  https://repo1.maven.org/maven2/io/trino/trino-cli/359/trino-cli-359-executable.jar
Resolving repo1.maven.org (repo1.maven.org)... 198.18.0.184, 2a04:4e42:a::209
Connecting to repo1.maven.org (repo1.maven.org)|198.18.0.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10137386 (9.7M) [application/java-archive]
Saving to: ‘trino-cli-359-executable.jar’

trino-cli-359-executable.jar  100%[================================================>]   9.67M  4.87MB/s    in 2.0s

2023-10-24 05:30:20 (4.87 MB/s) - ‘trino-cli-359-executable.jar’ saved [10137386/10137386]

root@trino-worker-1:/usr/local/trino-server-368/bin# mv trino-cli-359-executable.jar /usr/local/sbin/trino
root@trino-worker-1:/usr/local/trino-server-368/bin# chmod a+x /usr/local/sbin/trino
root@trino-worker-1:/usr/local/trino-server-368/bin# trino
trino>
trino>

root@trino-worker-1:/usr/local/trino-server-368/bin# trino --server trino-coordinator-1:8091
trino> use hive.default;
USE
trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20231024_053152_00003_ch5qd, FINISHED, 1 node
Splits: 25 total, 25 done (100.00%)
6.23 [197 rows, 442KB] [31 rows/s, 70.9KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20231024051251213   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20231024051251213   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20231024_053210_00004_ch5qd, FINISHED, 1 node
Splits: 9 total, 9 done (100.00%)
0.50 [197 rows, 450KB] [393 rows/s, 898KB/s]

trino:default>  select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20231024_053221_00005_ch5qd, FINISHED, 1 node
Splits: 25 total, 25 done (100.00%)
0.51 [197 rows, 442KB] [387 rows/s, 869KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20231024051414103   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20231024051414103   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20231024_053239_00006_ch5qd, FINISHED, 1 node
Splits: 9 total, 9 done (100.00%)
0.37 [197 rows, 450KB] [530 rows/s, 1.18MB/s]

trino:default>

Hudi 将数据仓库和数据库的核心功能直接引入数据湖。Hudi 提供了表、事务、高效的升级/删除、高级索引、流式摄取服务、数据集群(Clustering)、压缩优化和并发，同时将数据保持为开源文件格式，即可以把 Hudi 表的数据，保存在 HDFS，Amazon S3 等文件系统。

Hudi 之所以能快速流行起来，为多数开发用户接受，除了它可以轻松地在任何云平台上使用，并且可以通过任何流行的查询引擎（包括 Apache Spark、Flink、Presto、Trino、Hive 等）来访问 Hudi 的数据，更为难能可贵的，是 Hudi 的设计者考虑了尽可能多的业务场景和实际需求。

从实际的业务场景出发，对数据湖平台对需求，首先可以先分为两大类：读偏好和写偏好，所以 Apache Hudi 提供了两种类型的表：

Copy On Write 表：简称 COW，这类 Hudi 表使用列文件格式（例如 Parquet）存储数据，如果有数据写入，则会对整个 Parquet 文件进行复制，适合读偏好的操作
Merge On Read 表：简称 MOR，这类 Hudi 表使用列文件格式（例如 Parquet）和行文件格式（例如 Avro）共同存储数据。数据更新时，写到行文件中，然后进行压缩，以同步或异步方式生成列文件，适合写偏好的操作

再细分下来，Hudi 对两种类型的表，提供了不同的查询类型：

Snapshot Queries：快照查询，查询数据的最新快照，即全部的数据
Incremental Queries：增量查询，可以查询指定时间范围内的新增或修改的数据
Read Optimized Queries：读取优化查询，对 MOR 表来说，仅查询 Parquet 文件中的数据

以上三种查询类型，读优化查询只能用于 MOR 表（其实用于 COW 也没什么意义，本来 COW 就只有 Parquet 文件保存数据），另外两种查询模式，可以用于 COW 表和 MOR 表。

docker 升级 hadoop3 版本

进行 hive-sync 同步时报错 guava 版本不一致，经过多次排查是 hive-exec 的 guava 的冲突造成的。

1

hive-exec包guava冲突解决