Introduction:How to get the most out of SQL is the topic of this article. This paper attempts to find a unique approach, emphasizing that through flexible and divergent data processing thinking, complex data scenarios can be solved with the most basic syntax.
I. Introduction
1. Original intention
How to use MaxCompute (ODPS) SQL efficiently and get the most out of basic SQL syntax.
In today's big data is so popular, not only professional data personnel, but also need to deal with SQL frequently, even non-technical students such as products and operations will use SQL to a greater or lesser extent, how to efficiently play the ability of SQL, and then give full play to the ability of data, has become particularly important.
As a SQL dialect, MaxCompute (ODPS) SQL has complete SQL syntax support, rich built-in functions, and many advanced features such as window-opening functions, user-defined functions, and user-defined types, which can be efficiently applied in various data processing scenarios.
How to get the most out of SQL is the topic of this article. This article tries to find a new wayIt emphasizes that through flexible and divergent data processing thinking, complex data scenarios can be solved with the most basic syntax.
2. Suitable for people
This article may be helpful for beginners and veterans, but it is more suitable for intermediate and advanced readers.
At the same time, in order to avoid divergence of topics, the functions and syntax features involved in the article will not be introduced in a special way, and readers can understand it by themselves.
3. Content structure
In this article, we will introduce topics such as sequence generation, interval transformation, permutations and combinations, and continuous discrimination, and will explain the practical application with examples. There is a slight dependency between each topic, and it is better to read them in turn.
4. Prompt information
The SQL statements involved in this article only use the basic syntax features of MaxCompute (ODPS) SQL, and theoretically all SQL statements can be run in the current latest version.
Second, the number series
Sequences are one of the most common forms of data, and most of them are finite sequences encountered in actual data development scenarios. This section will start with the simplest incrementing sequence, find out the general approach, and generalize to more generalized scenarios.
1. Common number series
1) A simple sequence of ascending numbers
First, a simple sequence of increasing integer numbers is introduced:
Start with a value of 0;
Each subsequent value is incremented by 1;
to the end of value 3;
How do I generate a series that satisfies the above three criteria? i.e. [0,1,2,3].
In fact, there are several ways to generate this sequence, and here is a simple and versatile one.
sql - 1
select
t.pos as a_n
from (
select posexplode(split(space(3), space(1), false))
t;
As you can see from the above SQL snippet, there are only three steps to generate an incrementing sequence:
Generate an array of appropriate length, and the elements in the array do not need to have actual meaning;
Generate an index subscript for each element in the array via the UDTF function posexplode;
Take out the index subscript for each element. The above three steps can be generalized to more general sequence scenarios: equal difference series, equal proportion series. Based on this, the final implementation template will be given directly below.
2) Equal difference series
If the first item is set. The tolerances are:
Then the general formula for the difference series is.
SQL implementation:
sql - 2
select
a + t.pos * d as a_n
from (
select posexplode(split(space(n - 1), space(1), false))
t;3) Proportional series
If the first item is set. The common ratio is.
Then the general formula for the proportional series is.
SQL implementation:
sql - 3
select
a * pow(q, t.pos) as a_n
from (
select posexplode(split(space(n - 1), space(1), false))
t;Tip: You can also directly use the MaxCompute (ODPS) system function sequence to quickly generate a sequence.
sql - 4
select sequence(1, 3, 1);
result
2. Examples of application scenarios
1) Restore the name of the dimension column cluster under any dimension combination
In multi-dimensional analysis scenarios, high-level aggregation functions, such as cubes, rollups, and grouping sets, may be used to aggregate statistics for data under different dimensional combinations.
Scenario description
In the existing user access log table, each row of data represents a user access log.
sql - 5
with visit_log as (
select stack (
6,2024-01-01', '101', 'Hubei', 'Wuhan', 'android',2024-01-01', '102', 'Hunan', 'Changsha', 'ios',2024-01-01', '103', 'Sichuan', 'Chengdu', 'windows',2024-01-02', '101', 'Hubei', 'Xiaogan', 'mac',2024-01-02', '102', 'Hunan', 'Shaoyang', 'android',2024-01-03', '101', 'Hubei', 'Wuhan', 'ios'
Fields: Date, User, Province, City, Device Type.
as (dt, user_id, province, city, device_type)
select * from visit_log;
Now, for the three dimension columns of province, city, and device type, the number of user visits under different dimension combinations is obtained through grouping sets aggregation statistics. Q:
How do I know which dimension columns a statistic is aggregated from?
Want to outputThe name of the aggregated dimension columnHow to deal with scenarios such as downstream report display?
Solution Ideas:
This can be solved with the help of the Grouping ID provided by MaxCompute (ODPS), and the core method is to reverse implement the Grouping ID.
The detailed steps are as follows:
Have all the grouping IDs ready.
Generate an ascending sequence of values, convert each value to a binary string, and expand each bit of the binary string.
Thereinto. The number of all dimension columns is the number of all dimension combinations, and each numeric value represents a grouping ID.
Have all dimension names ready.
Generate a string sequence that saves the names of the dimension columns in turn, ie.
dim_name_1, dim_name_2, .dim_name_n }
Map the grouping id to the dimension column name.
For each numeric value in the grouping id incrementing sequence, map each bit of the value to the subscript of the dimension name sequence, and output all dimension names corresponding to bit 0. For example:
grouping__id:3 =>
Dimension Name Sequence:
Mapping: The aggregated dimension of the rows with grouping ID 3 is: Province.
SQL implementation:
sql - 6
with group_dimension as (
select -- the dimension field corresponding to each grouping.
gb.group_id, concat_ws(",", collect_list(case when gb.placeholder_bit = 0 then dim_col.val else null end)) as dimension_name
from (
select groups.pos as group_id, pe.*
from (
select posexplode(split(space(cast(pow(2, 3) as int) -1), space(1), false))
groups -- all groups.
lateral view posexplode(regexp_extract_all(lpad(conv(groups.pos,10,2), 3, "0"), '(0|1)'pe as placeholder idx, placeholder bit -- the bit information for each group.
gbleft join ( All dimension fields.
select posexplode(split("Province, city, device type", ','))
dim_col on gb.placeholder_idx = dim_col.pos
group by gb.group_id
select
group_dimension.dimension_name,province, city, device_type,visit_count
from (
select
grouping_id(province, city, device_type) as group_id,province, city, device_type,count(1) as visit_count
from visit_log b
group by province, city, device_type
grouping sets(
province),province, city),province, city, device_type)
tjoin group_dimension on t.group_id = group_dimension.group_id
order by group_dimension.dimension_name;
3. Intervals
Intervals have different data characteristics than sequences, but in practical applications, the processing of sequences and intervals has more commonality. This section will introduce some common interval scenarios and abstract common solutions.
1. Common interval operations
1) Interval segmentation
A numerical interval is known.
How do I divide this interval into sub-segments?
The problem can be simplified to a series of problems, and the formula for the series is.
Thereinto.
The specific steps are as follows:
Generate an array of length;
Generate an index subscript for each element in the array via the UDTF function posexplode;
The index subscript of each element is taken out, and the sequence formula is calculated to obtain the start and end values of each sub-interval.
SQL implementation:
sql - 7
select
a + t.pos * d as sub interval start, - the start value of the sub interval.
a + t.pos + 1) *d as sub interval end -- the end value of the sub interval.
from (
select posexplode(split(space(n - 1), space(1), false))
t;2) Interval crossing
There is known to be a crossover between the two date intervals ['2024-01-01', '2024-01-03'] 、'2024-01-02', '2024-01-04']。Q:
How do I merge two date intervals and return to the new merged interval?
How do I know which dates are crossover dates and return the number of crossovers for that date?
There are many ways to solve these problems, but here is a simple and versatile solution. The core idea is to combine the method of series generation and interval segmentation, first decompose the date interval into the smallest processing unit, that is, the series composed of multiple dates, and then do statistics based on the date granularity. The specific steps are as follows:
Get the number of days included in each date interval;
Split the date interval into a corresponding number of ascending date series based on the number of days contained in the date interval;
The merged intervals and the number of crossings are counted through the date series.
SQL implementation:
sql - 8
with dummy_table as (
select stack(
as (date_start, date_end)
select
min(date item) as date start merged,max(date item) as date end merged,collect set( - cross date count.
case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
as overlap_date
from (
select
A single date after the disassembly.
date add(date start, pos) as date item,- The number of occurrences of a single date after the disassembly.
count(1) over (partition by date_add(date_start, pos)) as date_item_cnt
from dummy_table
lateral view posexplode(split(space(datediff(date_end, date_start)),space(1), false)) t as pos, val
t;
Make it a little more difficult!
If there are multiple date intervals, and the intersecting state between the intervals is unknown, how to solve the above problem. Namely:
How do I merge multiple date intervals and return multiple new intervals after the merger?
How do I know which dates are crossover dates and return the number of crossovers for that date?
SQL implementation:
sql - 9
with dummy_table as (
select stack(
as (date_start, date_end)
select
min(date item) as date start merged,max(date item) as date end merged,collect set( - cross date count.
case when date_item_cnt > 1 then concat(date_item, ':', date_item_cnt) else null end
as overlap_date
from (
select
A single date after the disassembly.
date add(date start, pos) as date item,- The number of occurrences of a single date after the disassembly.
count(1) over (partition by date add(date start, pos)) as date item cnt,- For a single date after the disassembly, reorganize into new inter-district tags.
date_add(date_add(date_start, pos), 1 - dense_rank() over (order by date_add(date_start, pos)))as cont
from dummy_table
lateral view posexplode(split(space(datediff(date_end, date_start)),space(1), false)) t as pos, val
tgroup by cont;
2. Examples of application scenarios
1) Statistical data by any time period
Scenario description
Existing user repayment schedule, a piece of data in the table indicates that the user repays the repayment yuan every day in the specified date interval [date start, date end].
sql - 10
with user_repayment as (
select stack(
Fields: User, Start Date, End Date, Daily Repayment Amount.
as (user_id, date_start, date_end, repayment)
select * from user_repayment;
How can I calculate the total repayment amount of all users per day during any period of time (e.g. 2024-01-15 to 2024-01-16)?
Solution Ideas:
The core idea is to convert the date interval to a date series, and then collect ** according to the date series.
SQL implementation:
sql - 11
select
date_item as day,sum(repayment) as total_repayment
from (
select
date_add(date_start, pos) as date_item,repayment
from user_repayment
lateral view posexplode(split(space(datediff(date_end, date_start)),space(1), false)) t as pos, val
twhere date_item >= '2024-01-15' and date_item <= '2024-01-16'
group by date_item
order by date_item;
Fourth, permutations and combinations
Permutation and combination are common data organization methods for discrete data, and this section will introduce the implementation methods of permutation and combination, and focus on the processing of data through combination with examples.
1. Common permutation and combination operations
1) Arrangement
Known sequence of characters ['a', 'b', 'c'] 2 characters are repeatedly selected from the sequence at a time, how do I get all the permutations?
With the help of multiple lateral views, the overall implementation is relatively simple.
sql - 12
select
concat(val1, val2) as perm
from (select split('a,b,c', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2;
2) Combination
Known sequence of characters ['a', 'b', 'c'] Repeats 2 characters from the sequence at a time, how do I get all the combinations?
With the help of multiple lateral views, the overall implementation is relatively simple.
sql - 13
select
concat(least(val1, val2), greatest(val1, val2)) as comb
from (select split('a,b,c', ',') as characters) dummy
lateral view explode(characters) t1 as val1
lateral view explode(characters) t2 as val2
group by least(val1, val2), greatest(val1, val2);
Note: You can also use the MaxCompute (ODPS) system function combinations to quickly generate combinations.
sql - 14
select combinations(array('foo', 'bar', 'boo'),2);
result
foo', 'bar'], 'foo', 'boo']['bar', 'boo']]
2. Examples of application scenarios
1) Group comparison statistics
Scenario description
An existing delivery strategy conversion table, which contains a piece of data that represents the number of orders generated by a delivery strategy in a day.
sql - 15
with strategy_order as (
select stack(
3,2024-01-01', 'strategy a', 10,2024-01-01', 'strategy b', 20,2024-01-01', 'strategy c', 30
Fields: Date, Delivery Strategy, Order Volume.
as (dt, strategy, order_cnt)
select * from strategy_order;
How to set up a pairwise comparison group according to the delivery strategy, and compare and display the conversion order volume of different strategies by group?
Solution Ideas:
The core idea is to extract 2 strategies from the list of all delivery strategies without duplicate, generate all the combined results, and then associate the strategy order table to group the statistical results.
SQL implementation:
sql - 16
select /*+ mapjoin(combs) */
combs.strategy_comb,so.strategy,so.order_cnt
from strategy_order so
join ( Generate all comparison groups.
select
concat(least(val1, val2), '-', greatest(val1, val2)) as strategy_comb,least(val1, val2) as strategy_1, greatest(val1, val2) as strategy_2
from (
select collect_set(strategy) as strategies
from strategy_order
dummylateral view explode(strategies) t1 as val1
lateral view explode(strategies) t2 as val2
where val1 <>val2
group by least(val1, val2), greatest(val1, val2)
combs on 1 = 1
where so.strategy in (combs.strategy_1, combs.strategy_2)
order by combs.strategy_comb, so.strategy;
5. Continuous
This section focuses on continuity issues, with a focus on common sequentially active scenarios. Different implementation schemes are described for the continuous activity of static types and dynamic types.
1. Ordinary continuous active statistics
Scenario description
In the existing user access log table, each row of data represents a user access log.
sql - 17
with visit_log as (
select stack (
6,2024-01-01', '101', 'Hubei', 'Wuhan', 'android',2024-01-01', '102', 'Hunan', 'Changsha', 'ios',2024-01-01', '103', 'Sichuan', 'Chengdu', 'windows',2024-01-02', '101', 'Hubei', 'Xiaogan', 'mac',2024-01-02', '102', 'Hunan', 'Shaoyang', 'android',2024-01-03', '101', 'Hubei', 'Wuhan', 'ios'
Fields: Date, User, Province, City, Device Type.
as (dt, user_id, province, city, device_type)
select * from visit_log;
How do I get users with 2 consecutive days of visits?
The above questions are in the analysis of continuityThe results of obtaining continuity are subject to exceeding a fixed threshold, which is categorized hereStatistics on common continuous active scenarios where the continuous activity is greater than the n-day threshold.
SQL implementation:
Based on the difference between adjacent dates (lag lead version).
The overall implementation is relatively simple.
sql - 18
select user_id
from (
select,lag(dt, 2 - 1) over (partition by user_id order by dt) as lag_dt
from (select dt, user_id from visit_log group by dt, user_id) t0
t1where datediff(dt, lag_dt) +1 = 2
group by user_id;
Implemented based on the difference between adjacent dates (sorted version).
The overall implementation is relatively simple.
sql - 19
select user_id
from (
select *,dense_rank() over (partition by user_id order by dt) as dr
from visit_log
t1where datediff(dt, date_add(dt, 1 - dr)) 1 = 2
group by user_id;
Implemented based on consecutive active days
It can be seen as:Implemented based on the difference between adjacent dates (sorted version)., the implementation can get more information, such as the number of consecutive active days.
sql - 20
select user_id
from (
select,- The number of consecutive days of activity.
count(distinct dt)
over (partition by user_id, cont) as cont_days
from (
select,date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
t1 t2where cont_days >= 2
group by user_id;
Based on continuous active intervals
It can be seen as:Implemented based on the difference between adjacent dates (sorted version)., the implementation can get more information, such as continuous active intervals.
sql - 21
select user_id
from (
select
user id, cont, - continuous active interval.
min(dt) as cont_date_start, max(dt) as cont_date_end
from (
select,date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
t1group by user_id, cont
t2where datediff(cont_date_end, cont_date_start) +1 >= 2
group by user_id;
2. Dynamic continuous active statistics
Scenario description
In the existing user access log table, each row of data represents a user access log.
sql - 22
with visit_log as (
select stack (
6,2024-01-01', '101', 'Hubei', 'Wuhan', 'android',2024-01-01', '102', 'Hunan', 'Changsha', 'ios',2024-01-01', '103', 'Sichuan', 'Chengdu', 'windows',2024-01-02', '101', 'Hubei', 'Xiaogan', 'mac',2024-01-02', '102', 'Hunan', 'Shaoyang', 'android',2024-01-03', '101', 'Hubei', 'Wuhan', 'ios'
Fields: Date, User, Province, City, Device Type.
as (dt, user_id, province, city, device_type)
select * from visit_log;
How Do I Obtain the Longest 2 Continuously Active Users, Output Users, Longest Consecutive Active Days, and Longest Continuously Active Date Period?
When analyzing continuity, the results of obtaining continuity are not and cannot be compared with a fixed threshold, but each uses the longest continuous activity as the dynamic threshold, which is classified as followsStatistics of dynamic and continuous active scenes
SQL implementation:
Based onStatistics on common continuous active scenariosThe final sql:
sql - 23
select
user id, - The maximum number of consecutive active days.
datediff(max(dt), min(dt)) 1 as cont days,- The longest consecutive active date range.
min(dt) as cont_date_start, max(dt) as cont_date_end
from (
select,date_add(dt, 1 - dense_rank()
over (partition by user_id order by dt)) as cont
from visit_log
t1group by user_id, cont
order by cont_days desc
limit 2;
6. Expansion
Extending to more complex scenarios is a combination and variation of the content of the previous chapters of this article.
1. The interval is continuous (the longest sub-interval is divided).
Scenario description
User WiFi Log of Existing Users Scanning or Connecting to WiFi, each row of data represents a log of users scanning or connecting to WiFi at a certain time.
sql - 24
with user_wifi_log as (
select stack (
9,2024-01-01 10:01:00', '101', 'cmcc-starbucks', 'scan'-Scan.
2024-01-01 10:02:00', '101', 'cmcc-starbucks', 'scan',2024-01-01 10:03:00', '101', 'cmcc-starbucks', 'scan',2024-01-01 10:04:00', '101', 'cmcc-starbucks', 'conn'-Connect.
2024-01-01 10:05:00', '101', 'cmcc-starbucks', 'conn',2024-01-01 10:06:00', '101', 'cmcc-starbucks', 'conn',2024-01-01 11:01:00', '101', 'cmcc-starbucks', 'conn',2024-01-01 11:02:00', '101', 'cmcc-starbucks', 'conn',2024-01-01 11:03:00', '101', 'cmcc-starbucks', 'conn'
Fields: Time, User, WiFi, Status (Scan, Connected).
as (time, user_id, wifi, status)
select * from user_wifi_log;
User behavior analysis is now requiredHow to divide the different WiFi behavior intervals of users? Satisfied:
There are two types of behaviors: scan, conn;
The interval of behavior is defined as: the same behavior type, and the time difference between two adjacent behaviors is not more than 30 minutes;
Different behavior intervals should be the longest if the definition is satisfied;
The above issues are a little more complex and can be seen as follows:Dynamic continuous active statisticsLongest continuous activityvariants. It can be described as:Combined with the continuity threshold and the contextual information in the behavior sequence, the longest sub-interval is dividedproblems.
SQL implementation:
Core logic: Divide behavior intervals by user and wifi grouping, combined with continuity thresholds and behavior sequence context information.
Detailed steps: Group by user and wifi, and sort the data in chronological order in the group window;
If the time difference between the two records exceeds 30 minutes, or if the behavior status of the two records (scan, connect) changes, the behavior interval is divided by the critical point. until all records are traversed;
The final output result: user, wifi, behavior status (scanning state, connection state), behavior start time, behavior end time;
sql - 25
select
user_id,wifi,max(status) as status,min(time) as start_time,max(time) as end_time
from (
select *,max(if(lag_status is null or lag_time is null or status <>lag_status or datediff(time, lag_time, 'ss') >60 * 30, rn, null))
over (partition by user_id, wifi order by time) as group_idx
from (
select *,row_number() over (partition by user_id, wifi order by time) as rn,lag(time, 1) over (partition by user_id, wifi order by time) as lag_time,lag(status, 1) over (partition by user_id, wifi order by time) as lag_status
from user_wifi_log
t1 t2group by user_id, wifi, group_idx
The continuity discriminant in this case can be extended to more scenarios, such as data scenarios based on date difference, time difference, enumeration type, distance difference, etc.
Conclusion
Through flexible and sporadic data processing thinking, you can use basic grammar to solve complex data scenarios This is the idea that runs through the whole text of this article. In this paper, relatively general solutions are given for common scenarios such as sequence generation, interval transformation, permutation and combination, and continuous discrimination, and practical application is explained with examples.
This article tries to find a unique way, emphasizing flexible data processing thinking, hoping to make readers feel bright, and hope to really help readers. At the same time, after all, personal ability is limited, the idea is not necessarily optimal, and even errors may occur, welcome comments or suggestions.
Author丨Rigaud.
*丨***Ali Developer (ID: Ali Tech).