starrocks的存储分离教程

在上一章的部署环境之后,使用了程序。

1
2
3
4
5
6
7
#下载交通事故数据集

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv

# 下载天气数据

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv

再dbeaver上创建

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE STORAGE VOLUME shared

TYPE = S3

LOCATIONS = ("s3://starrocks/shared/")

PROPERTIES

(

"enabled" = "true",

"aws.s3.endpoint" = "http://10.7.10.190:9000",

"aws.s3.use_aws_sdk_default_behavior" = "false",

"aws.s3.enable_ssl" = "false",

"aws.s3.use_instance_profile" = "false",

"aws.s3.access_key" = "AAAAAAAAAAAAAAA",

"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBB"

);

SET shared AS DEFAULT STORAGE VOLUME;

查看一下storage VOLUME

{4273D6DF-85DA-4479-A477-944AF982CC0B}.png

接下来建表

创建数据库

执行以下语句创建数据库 quickstart,并切换到该数据库下。

1
CREATE DATABASE IF NOT EXISTS quickstart;USE quickstart;

建表

crashdata 表

创建 crashdata 表,用于存储交通事故数据集中的数据。该表的字段经过裁剪,仅包含与该教程相关字段。

1
CREATE TABLE IF NOT EXISTS crashdata (    CRASH_DATE DATETIME,    BOROUGH STRING,    ZIP_CODE STRING,    LATITUDE INT,    LONGITUDE INT,    LOCATION STRING,    ON_STREET_NAME STRING,    CROSS_STREET_NAME STRING,    OFF_STREET_NAME STRING,    CONTRIBUTING_FACTOR_VEHICLE_1 STRING,    CONTRIBUTING_FACTOR_VEHICLE_2 STRING,    COLLISION_ID INT,    VEHICLE_TYPE_CODE_1 STRING,    VEHICLE_TYPE_CODE_2 STRING);

weatherdata 表

创建 weatherdata 表,用于存储天气数据集中的数据。该表的字段同样经过裁剪,仅包含与该教程相关字段。

1
CREATE TABLE IF NOT EXISTS weatherdata (    DATE DATETIME,    NAME STRING,    HourlyDewPointTemperature STRING,    HourlyDryBulbTemperature STRING,    HourlyPrecipitation STRING,    HourlyPresentWeatherType STRING,    HourlyPressureChange STRING,    HourlyPressureTendency STRING,    HourlyRelativeHumidity STRING,    HourlySkyConditions STRING,    HourlyVisibility STRING,    HourlyWetBulbTemperature STRING,    HourlyWindDirection STRING,    HourlyWindGustSpeed STRING,    HourlyWindSpeed STRING);

导入数据集

StarRocks 提供多种数据导入方法。本教程使用 curl 命令发起 Stream Load 任务导入数据。

Stream Load 使用的 curl 命令虽然看起来复杂,但本教程在最后部分提供了详细解释。建议您先运行该命令导入数据,然后在教程结束后了解有关数据导入的详细内容。

导入纽约市交通事故数据

通过 FE 容器的 Shell Session 进入数据集文件所在的路径,然后运行以下命令。

1
curl --location-trusted -u root             \    -T ./NYPD_Crash_Data.csv                \    -H "label:crashdata-0"                  \    -H "column_separator:,"                 \    -H "skip_header:1"                      \    -H "enclose:\""                         \    -H "max_filter_ratio:1"                 \    -H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \    -XPUT http://localhost:8030/api/quickstart/crashdata/_stream_load

运行命令后,系统会提示您输入密码。由于您尚未为 root 用户分配密码,所以只需按 Enter 键跳过。

以下是上述命令的返回。其中 Status 为 Success,表示导入成功。NumberFilteredRows 为 1,表示数据集中包含一行错误数据,StarRocks 在导入过程将该行过滤。

1
Enter host password for user 'root':{    "TxnId": 2,    "Label": "crashdata-0",    "Status": "Success",    "Message": "OK",    "NumberTotalRows": 423726,    "NumberLoadedRows": 423725,    "NumberFilteredRows": 1,    "NumberUnselectedRows": 0,    "LoadBytes": 96227746,    "LoadTimeMs": 1013,    "BeginTxnTimeMs": 21,    "StreamLoadPlanTimeMs": 63,    "ReadDataTimeMs": 563,    "WriteDataTimeMs": 870,    "CommitAndPublishTimeMs": 57,    "ErrorURL": "http://10.5.0.3:8040/api/_load_error_log?file=error_log_da41dd88276a7bfc_739087c94262ae9f"}%

当 Stream Load 发生错误时,StarRocks 会返回一个包含错误信息的 URL。由于容器具有私有 IP 地址,您必须通过容器中的 Shell Session 运行 curl 命令查看。

1
curl http://10.5.0.3:8040/api/_load_error_log?file=error_log_da41dd88276a7bfc_739087c94262ae9f

点击下方 错误信息 查看 URL 中包含的错误信息。

错误信息

导入天气数据

依照同样的方式导入天气数据。

1
curl --location-trusted -u root             \    -T ./72505394728.csv                    \    -H "label:weather-0"                    \    -H "column_separator:,"                 \    -H "skip_header:1"                      \    -H "enclose:\""                         \    -H "max_filter_ratio:1"                 \    -H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \    -XPUT http://localhost:8030/api/quickstart/weatherdata/_stream_load

执行效果如下所示 {C5A7921B-F2AD-4612-91FA-569953263F3B}.png


验证数据是否存储在 MinIO 中

打开 MinIO Web UI 并验证在 starrocks/shared/ 下的每个路径中是否有 datametadata 和 schema 路径。

提示

在导入数据时,starrocks/shared/ 下的文件夹名称是动态生成的。您会在 shared 下面看到一个路径,然后在该路径下看到另外两个路径。在每个路径内,您都能看到 datametadata 和 schema 路径。 {64FBD934-469C-4214-833F-3B984FB85CBB}.png

查询数据并回答问题

查询一:纽约市每小时交通事故数量

1
SELECT COUNT(*),       date_trunc("hour", crashdata.CRASH_DATE) AS TimeFROM crashdataGROUP BY TimeORDER BY Time ASCLIMIT 200;

{B35948E1-5D0D-4EEA-8BC8-7DD714284DFB}.png

查询二:纽约市的平均气温

1
SELECT avg(HourlyDryBulbTemperature),       date_trunc("hour", weatherdata.DATE) AS TimeFROM weatherdataGROUP BY TimeORDER BY Time ASCLIMIT 100;

{DA925D9B-B468-4267-89E0-7C2C3CE95A3D}.png

查询三:能见度情况对驾驶安全的影响

为了解能见度情况对驾驶安全的影响,需要对两张表格的 DATETIME 列进行 JOIN,分析在能见度不佳的情况下(0 到 1.0 英里之间)时的交通事故数量。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13

SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,  
truncate(avg(w.HourlyDryBulbTemperature), 1) AS Temp_F,  
truncate(avg(w.HourlyVisibility), 2) AS Visibility,  
max(w.HourlyPrecipitation) AS Precipitation,  
date_format((date_trunc("hour", c.CRASH_DATE)), '%d %b %Y %H:%i') AS Hour  
FROM crashdata c  
LEFT JOIN weatherdata w  
ON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)  
WHERE w.HourlyVisibility BETWEEN 0.0 AND 1.0  
GROUP BY Hour  
ORDER BY Crashes DESC  
LIMIT 100;

{203E286C-D0B9-4E1E-A992-E91145CC176B}.png 从以上结果可以得出,在能见度较低的一小时内的最高交通事故数量是 129。

当然,除了能见度因素外,还有其他因素需要考虑在内:

  • 2014 年 2 月 3 日是星期一(工作日)
  • 上午 8 点是早高峰时段
  • 当时正在下雨(一小时内降水量为 0.12 英寸)
  • 温度为 32 华氏度(水的冰点)

查询四:结冰情况对驾驶安全的影响

由于路面上水大约会在 40 华氏度时开始转变为冰,因此以下查询分析了温度区间为 0 到 40 华氏度的交通事故数量。

1
SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,       truncate(avg(w.HourlyDryBulbTemperature), 1) AS Temp_F,       truncate(avg(w.HourlyVisibility), 2) AS Visibility,       max(w.HourlyPrecipitation) AS Precipitation,       date_format((date_trunc("hour", c.CRASH_DATE)), '%d %b %Y %H:%i') AS HourFROM crashdata cLEFT JOIN weatherdata wON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)WHERE w.HourlyDryBulbTemperature BETWEEN 0.0 AND 40.5 GROUP BY HourORDER BY Crashes DESCLIMIT 100;

{5181B7A6-94F0-4662-A284-7BECF430EC9B}.png

结果显示,2015 年 1 月 18 日发生了大量交通事故。虽然当天是星期天早晨,但根据 [weather.com] 显示,当天下了一场大雪,导致许多交通事故。

注意这里导入的时候,要放在local本地进行导入,再远程导入时报错 如下所示: {EC633690-4AB4-4091-97BF-D25B9A9E41F9}.png

Licensed under CC BY-NC-SA 4.0
最后更新于 Apr 11, 2025 07:19 UTC
comments powered by Disqus
Built with Hugo
主题 StackJimmy 设计
Caret Up