Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
A
Amazon-Selection-Data
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
abel_cjy
Amazon-Selection-Data
Commits
1c7acf03
Commit
1c7acf03
authored
May 18, 2026
by
chenyuanjie
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
fix
parent
67cc3704
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
9 additions
and
7 deletions
+9
-7
dim_asin_profit_rate_info.py
Pyspark_job/dim/dim_asin_profit_rate_info.py
+9
-7
No files found.
Pyspark_job/dim/dim_asin_profit_rate_info.py
View file @
1c7acf03
...
@@ -8,8 +8,8 @@ description: 利润率数据增量同步 + 去重整合 — 一站式 PG → Hiv
...
@@ -8,8 +8,8 @@ description: 利润率数据增量同步 + 去重整合 — 一站式 PG → Hiv
2) Spark 读当日 sqoop 增量 + Hive 历史所有分区,按 (asin, price) 去重
2) Spark 读当日 sqoop 增量 + Hive 历史所有分区,按 (asin, price) 去重
排序键:updated_time desc 保留最新一行
排序键:updated_time desc 保留最新一行
3) 覆盖当日分区为整合后全量快照;写入校验通过后删除所有 < 今日 的历史分区
3) 覆盖当日分区为整合后全量快照;写入校验通过后删除所有 < 今日 的历史分区
4)
整合后的当日数据
写入 Doris dwd.dwd_asin_profit_rate_latest
4)
当日 sqoop 增量(不含历史回灌)
写入 Doris dwd.dwd_asin_profit_rate_latest
Doris UNIQUE KEY(site_name, asin, price) + sequence_col=
asin_crawl_dat
e 自动取最新
Doris UNIQUE KEY(site_name, asin, price) + sequence_col=
update_tim
e 自动取最新
执行示例: spark-submit dim_asin_profit_rate_info.py us 2026-05-15
执行示例: spark-submit dim_asin_profit_rate_info.py us 2026-05-15
"""
"""
import
os
import
os
...
@@ -87,7 +87,8 @@ class DimAsinProfitRateInfo(object):
...
@@ -87,7 +87,8 @@ class DimAsinProfitRateInfo(object):
WHERE site_name = '{self.site_name}' AND date_info = '{self.date_info}'
WHERE site_name = '{self.site_name}' AND date_info = '{self.date_info}'
"""
"""
print
(
f
"sql_today =
\n
{sql_today}"
)
print
(
f
"sql_today =
\n
{sql_today}"
)
self
.
df_today
=
self
.
spark
.
sql
(
sqlQuery
=
sql_today
)
.
repartition
(
40
,
'asin'
,
'price'
)
# cache:save_data 会 DROP PARTITION 当日分区,write_to_doris 还要复用此 df,必须先物化
self
.
df_today
=
self
.
spark
.
sql
(
sqlQuery
=
sql_today
)
.
repartition
(
40
,
'asin'
,
'price'
)
.
cache
()
sql_history
=
f
"""
sql_history
=
f
"""
SELECT asin, price, category, ocean_profit, air_profit,
SELECT asin, price, category, ocean_profit, air_profit,
...
@@ -181,10 +182,10 @@ class DimAsinProfitRateInfo(object):
...
@@ -181,10 +182,10 @@ class DimAsinProfitRateInfo(object):
self
.
df_history
.
unpersist
()
self
.
df_history
.
unpersist
()
def
write_to_doris
(
self
):
def
write_to_doris
(
self
):
"""
整合后的当日数据写 Doris dwd_asin_profit_rate_latest
"""
当日 sqoop 增量数据写 Doris dwd_asin_profit_rate_latest(不回灌历史)
Doris UNIQUE KEY(site_name, asin, price) + sequence_col=
asin_crawl_date 自动按抓取
时间取最新
Doris UNIQUE KEY(site_name, asin, price) + sequence_col=
update_time 自动按更新
时间取最新
"""
"""
df_to_doris
=
self
.
df_
save
.
select
(
df_to_doris
=
self
.
df_
today
.
select
(
F
.
lit
(
self
.
site_name
)
.
alias
(
'site_name'
),
F
.
lit
(
self
.
site_name
)
.
alias
(
'site_name'
),
F
.
col
(
'asin'
),
F
.
col
(
'asin'
),
F
.
round
(
F
.
col
(
'price'
),
2
)
.
cast
(
'decimal(20,2)'
)
.
alias
(
'price'
),
F
.
round
(
F
.
col
(
'price'
),
2
)
.
cast
(
'decimal(20,2)'
)
.
alias
(
'price'
),
...
@@ -194,7 +195,7 @@ class DimAsinProfitRateInfo(object):
...
@@ -194,7 +195,7 @@ class DimAsinProfitRateInfo(object):
F
.
to_timestamp
(
F
.
col
(
'updated_time'
))
.
alias
(
'update_time'
),
F
.
to_timestamp
(
F
.
col
(
'updated_time'
))
.
alias
(
'update_time'
),
)
.
cache
()
)
.
cache
()
count
=
df_to_doris
.
count
()
count
=
df_to_doris
.
count
()
print
(
f
"写入 Doris 数据量:{count:,}"
)
print
(
f
"写入 Doris
增量
数据量:{count:,}"
)
df_to_doris
.
show
(
10
,
truncate
=
False
)
df_to_doris
.
show
(
10
,
truncate
=
False
)
table_columns
=
"site_name, asin, price, ocean_profit, air_profit, asin_crawl_date, update_time"
table_columns
=
"site_name, asin, price, ocean_profit, air_profit, asin_crawl_date, update_time"
...
@@ -205,6 +206,7 @@ class DimAsinProfitRateInfo(object):
...
@@ -205,6 +206,7 @@ class DimAsinProfitRateInfo(object):
table_columns
=
table_columns
,
table_columns
=
table_columns
,
)
)
df_to_doris
.
unpersist
()
df_to_doris
.
unpersist
()
self
.
df_today
.
unpersist
()
self
.
df_save
.
unpersist
()
self
.
df_save
.
unpersist
()
print
(
"success!"
)
print
(
"success!"
)
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment