sparksql 查询juicefs上的数据。

1.简介

传统的存算一体架构具有数据本地化的先天性能优势，而采用存算分离之后，由于所有的数据读写都要通过网络进行，因此就失去了数据本地化的性能优势。为了解决这个问题，会在计算和存储之间引入新的分布式缓存组件，例如 Juice FS 和 Alluxio，以进一步提升数据的读写性能。

Juice FS 和 Alluxio 是当前使用比较广泛的两款分布式缓存组件。

2.运行架构图

对数据访问性能要求相对不高的批处理任务，或者数据需要相对稳妥保存的批处理任务。例如数仓的 ODS，DWD，DIM 层的 spark sql 层的批处理任务他们是直接读写底层的存储的，对于 ADS 或者其他对读写响应要求比较高的作业，就可以通过分布式缓存来读写底层存储数据。

为什么要分开，而不是统一使用分布式缓存？

因为 JuiceFS 的架构特点，通过 JuiceFS 写入的数据，只能通过 JuiceFS 来进行读取，如果 JuiceFS 服务挂了，那么全部的计算任务都无法正常运行，这个风险实在是太高了。

juiceFs-sparksql 的测试

1.启动一个 hive metastore

1

tar -zvxf apache-hive-3.1.3.tgz

配置文件 hive-site.xml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120


❯ cat hive/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>

<configuration>

    <!-- ** Security -->
    <property>
        <name>hive.security.authorization.enabled</name>
        <value>false</value>
    </property>

    <property>
        <name>hive.security.authenticator.manager</name>
        <value>org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator</value>
    </property>

    <property>
        <name>hive.security.authorization.manager</name>
        <value>org.apache.hadoop.hive.ql.security.authorization.plugin.sqlstd.SQLStdConfOnlyAuthorizerFactory</value>
    </property>

    <property>
        <name>hive.privilege.synchronizer</name>
        <value>false</value>
    </property>

    <property>
        <name>hive.security.metastore.authenticator.manager</name>
        <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
    </property>

    <property>
        <name>hive.security.metastore.authorization.auth.reads</name>
        <value>true</value>
    </property>

    <property>
        <name>hive.security.metastore.authorization.manager</name>
        <value>org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetastoreAuthorizationProvider</value>
    </property>

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
    </property>

    <!-- jdbc 连接的 URL -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://10.7.3.100:30006/hive_hdfs_juicefs?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <!-- jdbc 连接的 Driver-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!-- jdbc 连接的 username-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <!-- jdbc 连接的 password -->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>root</value>
    </property>
    <!-- Hive 元数据存储版本的验证 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>true</value>
    </property>

    <!-- Metastore -->
    <property>
        <name>hive.metastore.sasl.enabled</name>
        <value>false</value>
    </property>

    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://localhost:9083</value>
    </property>

    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>jfs://juicefsminio/orders_hudi_2</value>
    </property>

    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>

    <property>
        <name>metastore.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>

    <!-- HiveServer2 -->
    <property>
        <name>hive.users.in.admin.role</name>
        <value>root,hive,hadoop</value>
    </property>

    <property>
        <name>hive.server2.authentication</name>
        <value>NONE</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop-test01</value>
    </property>

    <property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
    </property>

</configuration>

core-site.xml

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130


 cat hive/conf/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>fs.s3a.endpoint</name>
        <value>http://10.7.3.100:31889</value>
        <description>AWS S3 endpoint to connect to. An up-to-date list is
            provided in the AWS Documentation: regions and endpoints. Without this
            property, the standard region (s3.amazonaws.com) is assumed.
        </description>
    </property>

    <property>
        <name>fs.s3a.access.key</name>
        <value>MpCurbSf0FdWTbO9XCzB</value>
    </property>

    <property>
        <name>fs.s3a.secret.key</name>
        <value>d6VSjJAAyHqszi3RFif4Yg1UEBd9T4nL6hLcyK71</value>
    </property>

    <property>
        <name>fs.s3a.path.style.access</name>
        <value>true</value>
        <description>Enable S3 path style access ie disabling the default virtual hosting behaviour.
            Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
        </description>
    </property>

    <property>
        <name>fs.s3a.aws.credentials.provider</name>
        <value>
            org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
        </value>
        <description>
            Comma-separated class names of credential provider classes which implement
            com.amazonaws.auth.AWSCredentialsProvider.

            When S3A delegation tokens are not enabled, this list will be used
            to directly authenticate with S3 and DynamoDB services.
            When S3A Delegation tokens are enabled, depending upon the delegation
            token binding it may be used
            to communicate with the STS endpoint to request session/role
            credentials.

            These are loaded and queried in sequence for a valid set of credentials.
            Each listed class must implement one of the following means of
            construction, which are attempted in order:
            * a public constructor accepting java.net.URI and
            org.apache.hadoop.conf.Configuration,
            * a public constructor accepting org.apache.hadoop.conf.Configuration,
            * a public static method named getInstance that accepts no
            arguments and returns an instance of
            com.amazonaws.auth.AWSCredentialsProvider, or
            * a public default constructor.

            Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
            anonymous access to a publicly accessible S3 bucket without any credentials.
            Please note that allowing anonymous access to an S3 bucket compromises
            security and therefore is unsuitable for most use cases. It can be useful
            for accessing public data sets without requiring AWS credentials.

            If unspecified, then the default list of credential provider classes,
            queried in sequence, is:
            * org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider: looks
            for session login secrets in the Hadoop configuration.
            * org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
            Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
            * com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
            configuration of AWS access key ID and secret access key in
            environment variables named AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
            and AWS_SESSION_TOKEN as documented in the AWS SDK.
            * org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider: picks up
            IAM credentials of any EC2 VM or AWS container in which the process is running.
        </description>
    </property>

    <property>
        <name>fs.defaultFS</name>
        <value>jfs://juicefsminio/hudi-dir</value>
        <description>Optional, you can also specify full path "jfs://myjfs/path-to-dir" with location to use JuiceFS</description>
    </property>
    <property>
        <name>fs.jfs.impl</name>
        <value>io.juicefs.JuiceFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.jfs.impl</name>
        <value>io.juicefs.JuiceFS</value>
    </property>
    <property>
        <name>juicefs.meta</name>
        <value>mysql://root:root@(10.7.3.100:30006)/juicefs</value>
    </property>
    <property>
        <name>juicefs.cache-dir</name>
        <value>/tmp/juicefs-cache-dir</value>
    </property>
    <property>
        <name>juicefs.cache-size</name>
        <value>1024</value>
    </property>
    <property>
        <name>juicefs.access-log</name>
        <value>/tmp/juicefs.access.log</value>
    </property>
    <property>
      <name>juicefs.superuser</name>
      <value>flink</value>
    </property>
</configuration>

上传需要的 jar 包

1

scp -r juicefs-haddop-1.1.0-dev.jar hive/lib/

启动 hive-metastore

1
2
3
4


1.第一次启动需要格式化
hive/bin/schematool -initSchema -dbType mysql - verbose
后台化启动
nohup hive/bin/hive --service metastore 2>&1 &

9.spark 测试

1
2


spark-3.5.1-bin-hadoop3/
ln -s spark-3.5.1-bin-hadoop5 spark

接下来把 hive 的 core-site.xml hive-site.xml 都拷贝到 spark/conf 目录下

将 juicefs-hadoop-1.1.0-dev.jar 拷贝到 spark/jars

使用 spark 进行测试

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150


 spark/bin/spark-sql \
 --verbose \
 --database default \
 --name sql_test_1 \
 --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/driver-sql-hadoop-jfs-1.log" \
 --conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/executor-sql-hadoop-jfs-1.log" \
 -e \
 "
 DROP TABLE IF EXISTS spark_dept;
 CREATE TABLE spark_dept(deptno int, dname string, loc string);
 INSERT INTO spark_dept VALUES (10, 'ACCOUNTING', 'NEW YORK');
 select * from spark_dept;
 select count(*) from spark_dept where deptno=10
 "
Maximum heap size rounded up to minimum supported size 512 MB, specified Xmx is 128 MB.
Using properties file: null
24/07/15 14:24:33 WARN Utils: Your hostname, xfhuang-pc resolves to a loopback address: 127.0.1.1; using 198.18.0.1 instead (on interface eth0)
24/07/15 14:24:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Parsed arguments:
  master                  local[*]
  remote                  null
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          null
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  -Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/driver-sql-hadoop-jfs-1.log
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
  primaryResource         spark-internal
  name                    sql_test_1
  childArgs               [--database default -e
 DROP TABLE IF EXISTS spark_dept;
 CREATE TABLE spark_dept(deptno int, dname string, loc string);
 INSERT INTO spark_dept VALUES (10, 'ACCOUNTING', 'NEW YORK');
 select * from spark_dept;
 select count(*) from spark_dept where deptno=10
 ]
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  (spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/driver-sql-hadoop-jfs-1.log)
  (spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/executor-sql-hadoop-jfs-1.log)


Main class:
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
Arguments:
--database
default
-e

 DROP TABLE IF EXISTS spark_dept;
 CREATE TABLE spark_dept(deptno int, dname string, loc string);
 INSERT INTO spark_dept VALUES (10, 'ACCOUNTING', 'NEW YORK');
 select * from spark_dept;
 select count(*) from spark_dept where deptno=10

--verbose
Spark config:
(spark.app.name,sql_test_1)
(spark.app.submitTime,1721024673994)
(spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/driver-sql-hadoop-jfs-1.log)
(spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true -Dlog.file=/opt/spark/logs/executor-sql-hadoop-jfs-1.log)
(spark.jars,)
(spark.master,local[*])
(spark.submit.deployMode,client)
(spark.submit.pyFiles,)
Classpath elements:



Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/15 14:24:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/15 14:24:34 WARN HiveConf: HiveConf of name hive.privilege.synchronizer does not exist
24/07/15 14:24:34 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
Zing VM warning: alt signal stk requested for user signal handler (signal 1).
Zing VM warning: alt signal stk requested for user signal handler (signal 2).
Zing VM warning: alt signal stk requested for user signal handler (signal 7).
Zing VM warning: alt signal stk requested for user signal handler (signal 8).
Zing VM warning: alt signal stk requested for user signal handler (signal 11).
Zing VM warning: alt signal stk requested for user signal handler (signal 12).
Zing VM warning: alt signal stk requested for user signal handler (signal 13).
Zing VM warning: alt signal stk requested for user signal handler (signal 15).
Zing VM warning: alt signal stk requested for user signal handler (signal 23).
Zing VM warning: alt signal stk requested for user signal handler (signal 60).
Zing VM warning: alt signal stk requested for user signal handler (signal 61).
Zing VM warning: alt signal stk requested for user signal handler (signal 62).
2024/07/15 14:24:35.013278 juicefs[3128621] <WARNING>: The latency to database is too high: 12.534501ms [sql.go:240]
24/07/15 14:24:35 WARN JuiceFileSystemImpl: 2024/07/15 14:24:35.013278 juicefs[3128621] <WARNING>: The latency to database is too high: 12.534501ms [sql.go:240]
Spark Web UI available at http://198.18.0.1:4040
Spark master: local[*], Application Id: local-1721024676263

 DROP TABLE IF EXISTS spark_dept
Time taken: 1.867 seconds

 CREATE TABLE spark_dept(deptno int, dname string, loc string)
24/07/15 14:24:41 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
24/07/15 14:24:41 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 0.859 seconds

 INSERT INTO spark_dept VALUES (10, 'ACCOUNTING', 'NEW YORK')
Time taken: 4.986 seconds

 select * from spark_dept
10      ACCOUNTING      NEW YORK
Time taken: 0.663 seconds, Fetched 1 row(s)

 select count(*) from spark_dept where deptno=10

24/07/15 14:24:48 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(GPGC Old Pauses), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
24/07/15 14:24:48 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(GPGC New Cycles, GPGC Old Pauses), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
24/07/15 14:24:48 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(GPGC New Pauses, GPGC New Cycles, GPGC Old Pauses), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
24/07/15 14:24:48 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(GPGC Old Cycles, GPGC New Pauses, GPGC New Cycles, GPGC Old Pauses), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
1
Time taken: 0.987 seconds, Fetched 1 row(s)

░▒▓󰣇 ~/environment  14:24  took 19s 
❯ hadoop/bin/hadoop fs -ls /orders_hudi_2
Zing VM warning: alt signal stk requested for user signal handler (signal 1).
Zing VM warning: alt signal stk requested for user signal handler (signal 2).
Zing VM warning: alt signal stk requested for user signal handler (signal 7).
Zing VM warning: alt signal stk requested for user signal handler (signal 8).
Zing VM warning: alt signal stk requested for user signal handler (signal 11).
Zing VM warning: alt signal stk requested for user signal handler (signal 12).
Zing VM warning: alt signal stk requested for user signal handler (signal 13).
Zing VM warning: alt signal stk requested for user signal handler (signal 15).
Zing VM warning: alt signal stk requested for user signal handler (signal 23).
Zing VM warning: alt signal stk requested for user signal handler (signal 60).
Zing VM warning: alt signal stk requested for user signal handler (signal 61).
Zing VM warning: alt signal stk requested for user signal handler (signal 62).
2024-07-15 14:24:56,784 INFO fs.TrashPolicyDefault: The configured checkpoint interval is 0 minutes. Using an interval of 0 minutes that is used for deletion instead
2024-07-15 14:24:56,785 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Found 1 items
drwxr-xr-x   - xfhuang supergroup       4096 2024-07-15 14:24 /orders_hudi_2/spark_dept

JuiceFS 与 HDFS 的用户权限管理有何相同和不同之处？

JuiceFS 也是使用「用户／用户组」的方式管理文件权限，默认使用的是本地的用户和用户组。为了保证分布式计算时不同节点的权限统一，可以通过 juicefs.users 和 juicefs.groups 配置全局的「用户／UID」和「用户组／GID」映射。

在执行的过程中报错用户没有权限，可以在 core-site.xml 添加 superuser 如下图所示

参考文档：

https://juicefs.com/docs/zh/community/hadoop_java_sdk/

http://www.fblinux.com/?p=3103

*1.*简介

*2.*运行架构图

JuiceFS 与 HDFS 的用户权限管理有何相同和不同之处？

1.简介

2.运行架构图