docker体验hudi

前置条件

hudi 的版本用 0.13.1

为什么选择这个版本,因为 docker 版本的 hudi 是采用 scala2.11 体验的,在 hudi 版本 0.14 时,scala.version 变成了 2.12,这个版本经过多次实验报错。scala 的版本不对。

1
2
hudi  (7a654395) [ 🛤️ 1 ][📦 v0.13.1][ v1.8.0]
mvn clean package -Pintegration-tests -DskipTests

注意-Pintegration-tests 一定要添加,只有这个才能正常运行 demo。

编译成功后

1
docker/set_demo.sh dev

运行脚本进行测试

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
cker exec -it adhoc-1 /bin/bash
root@adhoc-1:/opt# $SPARK_INSTALL/bin/spark-shell \
>   --jars $HUDI_SPARK_BUNDLE \
>   --master local[2] \
>   --driver-class-path $HADOOP_CONF_DIR \
>   --conf spark.sql.hive.convertMetastoreParquet=false \
>   --deploy-mode client \
>   --driver-memory 1G \
>   --executor-memory 3G \
>   --num-executors 1
23/10/24 05:17:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://adhoc-1:4040
Spark context available as 'sc' (master = local[2], app id = local-1698124644178).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala>  spark.sql("show tables").show(100, false)
+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |stock_ticks_cow   |false      |
|default |stock_ticks_mor_ro|false      |
|default |stock_ticks_mor_rt|false      |
+--------+------------------+-----------+


scala> spark.sql("select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG'").show(100, false)
23/10/24 05:18:11 WARN config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/10/24 05:18:11 WARN config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051251213  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051251213  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> spark.sql("select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala>  spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
+------+-------------------+
|symbol|max(ts)            |
+------+-------------------+
|GOOG  |2018-08-31 10:29:00|
+------+-------------------+


scala>  spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051414103  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051414103  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
+-------------------+------+-------------------+------+---------+--------+
|_hoodie_commit_time|symbol|ts                 |volume|open     |close   |
+-------------------+------+-------------------+------+---------+--------+
|20231024051414103  |GOOG  |2018-08-31 09:59:00|6330  |1230.5   |1230.02 |
|20231024051414103  |GOOG  |2018-08-31 10:29:00|3391  |1230.1899|1230.085|
+-------------------+------+-------------------+------+---------+--------+


scala> :quit
root@adhoc-1:/opt# exit

运行 trino 失败

1
2
cker exec -it adhoc-2 trino --server trino-coordinator-1:8091
OCI runtime exec failed: exec failed: unable to start container process: exec: "trino": executable file not found in $PATH: unknown

这是因为没有 trino 这个程序,下载 trino 的 jar 包,赋权即可。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
whttps://repo1.maven.org/maven2/io/trino/trino-cli/359/trino-cli-359-executable.jar.jar
--2023-10-24 05:30:17--  https://repo1.maven.org/maven2/io/trino/trino-cli/359/trino-cli-359-executable.jar
Resolving repo1.maven.org (repo1.maven.org)... 198.18.0.184, 2a04:4e42:a::209
Connecting to repo1.maven.org (repo1.maven.org)|198.18.0.184|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10137386 (9.7M) [application/java-archive]
Saving to: ‘trino-cli-359-executable.jar’

trino-cli-359-executable.jar  100%[================================================>]   9.67M  4.87MB/s    in 2.0s

2023-10-24 05:30:20 (4.87 MB/s) - ‘trino-cli-359-executable.jar’ saved [10137386/10137386]

root@trino-worker-1:/usr/local/trino-server-368/bin# mv trino-cli-359-executable.jar /usr/local/sbin/trino
root@trino-worker-1:/usr/local/trino-server-368/bin# chmod a+x /usr/local/sbin/trino
root@trino-worker-1:/usr/local/trino-server-368/bin# trino
trino>
trino>

root@trino-worker-1:/usr/local/trino-server-368/bin# trino --server trino-coordinator-1:8091
trino> use hive.default;
USE
trino:default> select symbol, max(ts) from stock_ticks_cow group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20231024_053152_00003_ch5qd, FINISHED, 1 node
Splits: 25 total, 25 done (100.00%)
6.23 [197 rows, 442KB] [31 rows/s, 70.9KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_cow where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20231024051251213   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20231024051251213   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20231024_053210_00004_ch5qd, FINISHED, 1 node
Splits: 9 total, 9 done (100.00%)
0.50 [197 rows, 450KB] [393 rows/s, 898KB/s]

trino:default>  select symbol, max(ts) from stock_ticks_mor_ro group by symbol HAVING symbol = 'GOOG';
 symbol |        _col1
--------+---------------------
 GOOG   | 2018-08-31 10:29:00
(1 row)

Query 20231024_053221_00005_ch5qd, FINISHED, 1 node
Splits: 25 total, 25 done (100.00%)
0.51 [197 rows, 442KB] [387 rows/s, 869KB/s]

trino:default> select "_hoodie_commit_time", symbol, ts, volume, open, close  from stock_ticks_mor_ro where  symbol = 'GOOG';
 _hoodie_commit_time | symbol |         ts          | volume |   open    |  close
---------------------+--------+---------------------+--------+-----------+----------
 20231024051414103   | GOOG   | 2018-08-31 09:59:00 |   6330 |    1230.5 |  1230.02
 20231024051414103   | GOOG   | 2018-08-31 10:29:00 |   3391 | 1230.1899 | 1230.085
(2 rows)

Query 20231024_053239_00006_ch5qd, FINISHED, 1 node
Splits: 9 total, 9 done (100.00%)
0.37 [197 rows, 450KB] [530 rows/s, 1.18MB/s]

trino:default>

Hudi 将数据仓库和数据库的核心功能直接引入数据湖。Hudi 提供了表、事务、高效的升级/删除、高级索引、流式摄取服务、数据集群(Clustering)、压缩优化和并发,同时将数据保持为开源文件格式,即可以把 Hudi 表的数据,保存在 HDFS,Amazon S3 等文件系统。

Hudi 之所以能快速流行起来,为多数开发用户接受,除了它可以轻松地在任何云平台上使用,并且可以通过任何流行的查询引擎(包括 Apache Spark、Flink、Presto、Trino、Hive 等)来访问 Hudi 的数据,更为难能可贵的,是 Hudi 的设计者考虑了尽可能多的业务场景和实际需求。

从实际的业务场景出发,对数据湖平台对需求,首先可以先分为两大类:读偏好和写偏好,所以 Apache Hudi 提供了两种类型的表:

  • Copy On Write 表:简称 COW,这类 Hudi 表使用列文件格式(例如 Parquet)存储数据,如果有数据写入,则会对整个 Parquet 文件进行复制,适合读偏好的操作
  • Merge On Read 表:简称 MOR,这类 Hudi 表使用列文件格式(例如 Parquet)和行文件格式(例如 Avro)共同存储数据。数据更新时,写到行文件中,然后进行压缩,以同步或异步方式生成列文件,适合写偏好的操作

再细分下来,Hudi 对两种类型的表,提供了不同的查询类型:

  • Snapshot Queries:快照查询,查询数据的最新快照,即全部的数据
  • Incremental Queries:增量查询,可以查询指定时间范围内的新增或修改的数据
  • Read Optimized Queries:读取优化查询,对 MOR 表来说,仅查询 Parquet 文件中的数据

以上三种查询类型,读优化查询只能用于 MOR 表(其实用于 COW 也没什么意义,本来 COW 就只有 Parquet 文件保存数据),另外两种查询模式,可以用于 COW 表和 MOR 表。

docker 升级 hadoop3 版本

进行 hive-sync 同步时报错 guava 版本不一致,经过多次排查是 hive-exec 的 guava 的冲突造成的。

1
hive-exec包guava冲突解决
Licensed under CC BY-NC-SA 4.0
最后更新于 Jan 06, 2025 05:52 UTC
comments powered by Disqus
Built with Hugo
主题 StackJimmy 设计
Caret Up