Teradata V14.10 Join Feature PRPD
- Rows being joined must be on the same AMP, so Teradata will redistribute one or both tables by the join condition if the rows are not naturally on the same AMP.
- If a table is being redistributed by the join column and that column causes skew, the query might abort with a spool error because all spool spaces is calculated on a per AMP basis.
Partial Redistribution Partial Duplication (PRPD)
Partial Redistribution Partial Duplication (PRPD) is a new hybrid join strategy for joins with equality join conditions in the case where one or both of the joining tables are skewed.
- It provides a “divide-and-conquer” approach to reduce the impact of data skew on query and system “HOT AMP” situations.
- PRPD divides the two sources into subsets based on skewed values extracted from statistics histograms, performs several (2 or 3) regular joins between those subsets, and combines the join results of the subsets into a common spool.
- 2-fold plan: If only one source is skewed, both sources are divided/split into two subsets and 2 regular joins are performed.
- 3-fold plan: If both sources are skewed on different values, both of them are divided/split into three subsets, and 3 regular joins are performed.
Joining Skewed Tables on an Equality Join Condition
- When one or both tables in a join contain skewed values, the performance of the join operation is frequently degraded. Skew on certain values refers to the variation in the number of rows that contain those values among the various AMPs. If the variation is high, meaning that some AMPs have a high number of rows with those values and some AMPs have far fewer rows with the values, the values are described as being skewed and the base table is said to be skewed on these values.
- For example, consider the 2 tables material and price, which are joined on the condition material.material_id = price.material_id, and column price.material_id is skewed on the value 1.
The classic join strategies pursue various geographic methods such as those listed below, where the relative sizes of the individual tables play a major role in selecting the optimal strategy. For example, if the table being duplicated in the second or third plan is large or if both the tables are large, any plan chosen from the list can degrade performance.
- Redistribute material on material.material_id and redistribute price on price.material_id.
- Duplicate material and access price locally/directly.
- Access material locally/directly and duplicate price.
A partial redistribution partial duplication (PRPD) join strategy helps to minimize the impact of skew on join performance by, for example, making the join between price and material as 2 separate joins.
- For the first join, PRPD keeps the rows with the skewed value 1 of price.material_id locally on each AMP and duplicates rows from material.material_id that match the skewed value of price.material_id to all AMPs and then joins only those rows.
- For the second join, PRPD joins the non-skewed rows from the price table and the rest of the rows from the material table using whatever join method the Optimizer determines to be best.
- Teradata Database combines the results of these 2 joins as the final join result.
- The process of dividing the rows in a single source into several subparts is referred to as a split. Using the PRPD join strategy, Teradata Database splits both the price and material tables into 2 relations that participate in 2 regular joins later in the process. For the subparts of the 2 tables with skewed values, the Optimizer also selects the best join plan based on the costs of the different join plans, so the geographies are set according to the most optimal join plan.
- The preceding case is just one possible example of PRPD when there is a single skewed value in a single column of one table. The Optimizer can also use PRPD when there is more than one skewed values and multiple join columns with skew in one or both of the tables. When the set of join conditions is on expressions of base table columns and statistics have been collected on the expressions, the Optimizer can also use PRPD to join skewed tables.
- For the preceding example, the Optimizer selects a local geography for the skewed rows in price and a duplication geography for the rows with skewed values in material. These geographies are not fixed in PRPD.
- PRPD requires the same demographic support as any other join operation, such as accurate statistics, which it uses to determine the list of skewed values. PRPD uses skew detection logic to update existing statistics if the Optimizer determines that is necessary. The Optimizer selects a join plan using PRPD only if it is cheaper than other join methods. When both relations being joined are skewed on the same value, the selection of which relation is the skewed relation depends on the number of rows with skewed values and the row size of both relations. The Optimizer selects the relation with a higher cost based on those factors as the skewed relation.
- The main challenge the Optimizer faces when it determines whether to use PRPD for a join is to detect the biased values and their frequencies because the single table conditions and previous joins, if any, might filter out some loners and alter the frequency of the surviving loners. The Optimizer does not always have accurate loner information for PRPD planning.
- PRPD is designed to operate in one of 2 modes, depending on an internal parameter setting.
|MODE||The Optimizer considers a PRPD plan|
Example 1) : PRPD for Skewed Expression Statistics Following are some examples to clarify when the Optimizer can try a PRPD plan with join operations.
Consider the following 3 tables for the example set.
- t1(x1, y1, z1), primary index defined on (x1)
- t2(x2, y2, z2), primary index defined on (x2)
- t3(x3, y3, z3), primary index defined on (x3)
In the following examples, when it is said that a relation qualifies for PRPD, it means that the following items are true.
- Statistics are collected on the hashed join column set.
- The demographics of the join column set satisfy a set of conditions that are determined by an internal parameter setting.
- Normal PRPD mode is assumed by default. If an example applies to aggressive mode only, that is explicitly stated.
The Optimizer evaluates t1 for PRPD only when statistics on the expression (t1.y1 + t1.z1) are available. If the Optimizer determines that t1 qualifies for PRPD based on (t1.y1 + t1.z1) expression statistics,
WHERE t1.y1 + t1.z1 = t2.y2;
Example 2) : PRPD for Multiple Skewed Tables
In this example, if either t1 or t2 qualifies for PRPD, the Optimizer tries a PRPD plan by considering t1 or t2 as the skewed relation. In this case, there are two partial joins. If both tables t1 and t2 are skewed, the Optimizer tries a PRPD plan using 3 joins.
FROM t1, t2
WHERE t1.y1 = t2.y2;
Example 3) : PRPD for Multicolumn Sets Skewed on Some Values
Suppose (t1.y1, t1.z1) and (t2.y2, t2.z2) are skewed on some values and you have collected multicolumn statistics on those column sets.
To process this request, the Optimizer normally picks an inclusion join with a pre-join sort that removes duplicates on (t1.y1, t1.z1). If there is skew on (t1.y1, t1.z1), the pre-join sort removes the duplicates and removes the skew as well. Because of this, the Optimizer does not evaluate a PRPD plan by considering t1 to be a skewed table. However, if there is skew on (t2.y2, t2.z2), the Optimizer does consider a PRPD plan.
WHERE t2.y2 IN (SELECT t1.y1
WHERE t2.z2 = t1.z1
Example 4) : PRPD for Skewed Single Column Statistics
Suppose t1.y1 is skewed, statistics have been collected on t1.y1, and the first join (call it R4) is between tables t1 and t3. After the first join, the Optimizer cannot determine which loners from y1 survived the join condition t1.x1 = t3.x3, so for the R4 X R2 join, R4 does not qualify for PRPD. For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.
FROM t1,t2, t3
WHERE t1.y1 = t2.y2
AND t1.x1 = t3.x3;
Example 5) : PRPD for Mixed Single‑Column and Multicolumn Statistics on Skewed Values
In this example, the derived statistics logic can find the range of surviving values for t1.y1 after applying the single‑table condition, so t1 qualifies for PRPD if statistics have been collected on (t1.y1) and (t1.z1, t1.y1).
If multicolumn statistics on (t1.z1, t1.y1) have not been collected, the Optimizer cannot find the surviving loners on column t1.y1 after it applies the single-table condition so t1 does not qualify for PRPD in normal mode. For this case, the Optimizer considers PRPD only if the internal parameters for PRPD are set for aggressive mode.
WHERE t1.y1 = t2.y2
AND t1.z1 > 5;
There are other scenarios as well where the Optimizer can find the surviving values on the join column with some confidence when there are single‑table conditions. An example is when there is a sparse join index that covers the table and statistics have been collected on the join column in the join index. For this example, the join can qualify for PRPD if there is a join index with the following definition and statistics are collected on j1.y1, t1 can qualify for PRPD.
CREATE JOIN INDEX j1 AS
FROM t1 WHERE z1 > 5;
PRPD : Research doc : http://idb.snu.ac.kr/images/e/e9/Handling_Data_Skew.pdf