TaxiTour多窗口测试
介绍
这个测试场景使用公开的TaxiTour数据,数据已经预处理成Parquet文件格式,测试脚本地址为 https://github.com/4paradigm/SparkFE/tree/main/benchmark/taxi_tour_multiple_window 。
测试SQL
测试的SQL脚本包含两个Window,窗口大小默认为10000,如果窗口越大那么整体运行时间变长,但SparkFE的性能提升越明显。
select
trip_duration,
passenger_count,
sum(pickup_latitude) over w as vendor_sum_pl,
max(pickup_latitude) over w as vendor_max_pl,
min(pickup_latitude) over w as vendor_min_pl,
avg(pickup_latitude) over w as vendor_avg_pl,
sum(pickup_latitude) over w2 as pc_sum_pl,
max(pickup_latitude) over w2 as pc_max_pl,
min(pickup_latitude) over w2 as pc_min_pl,
avg(pickup_latitude) over w2 as pc_avg_pl,
sum(dropoff_latitude) over w as vendor_sum_pl2,
max(dropoff_latitude) over w as vendor_max_pl2,
min(dropoff_latitude) over w as vendor_min_pl2,
avg(dropoff_latitude) over w as vendor_avg_pl2,
sum(dropoff_latitude) over w2 as pc_sum_pl2,
max(dropoff_latitude) over w2 as pc_max_pl2,
min(dropoff_latitude) over w2 as pc_min_pl2,
avg(dropoff_latitude) over w2 as pc_avg_pl2,
sum(trip_duration) over w as vendor_sum_pl3,
max(trip_duration) over w as vendor_max_pl3,
min(trip_duration) over w as vendor_min_pl3,
avg(trip_duration) over w as vendor_avg_pl3,
sum(trip_duration) over w2 as pc_sum_pl3,
max(trip_duration) over w2 as pc_max_pl3,
min(trip_duration) over w2 as pc_min_pl3,
avg(trip_duration) over w2 as pc_avg_pl3
from t1
window w as (partition by vendor_id order by pickup_datetime ROWS BETWEEN 10000 PRECEDING AND CURRENT ROW),
w2 as (partition by passenger_count order by pickup_datetime ROWS BETWEEN 10000 PRECEDING AND CURRENT ROW)测试脚本
测试使用简单PySpark即可运行,也可以使用Scala等语言实现,小数据量下本地可运行,示例程序如下。
Last updated