teradata-consulting-services

There is a demand to have a functionality similar to the industry standard SHA256 hash function to condense RDBMs table content to a single value which is changing completely in case a single bit changes in multi million or billion row tables.
The main driver for this demand is the fact that dual systems – EDW dual active or EDW and Appliances mixtures – become more often in use. With this the question of cross system reconciliation becomes more important. Requirements can be to monitor the correctness of IT processes on both systems but also regulatory requirements like SOX compliance.
In 2013 I shared the idea of the table hash calculation which is not order depend and will work also in relational database systems at http://developer.teradata.com/blog/ulrich/2013/05/calculation-of-table-hash-values-to-compare-table-content . This post was referring to my Partners presentation of 2012 that is attached to this post.

As the topic sounds very theoretically I developed a Java UDF implementation to allow everybody who is interested to test this approach.

The Java UDF implemented the algorithm mentioned in the Partners presentation of 2012. Due to the fact that Java UDFs are always running in protected mode this UDF will not give you good performance but it should show you how the UDF is called and how it can be used to reconcile tables / queries between two different systems.

In case of interest on the fast version (C-UDF with faster hash functions) contact me directly, but don’t expect things for free. This is related to long ongoing R&D activities which have it’s prise.

1. tablehash_sha1.zi

This contains one jar file, 3 SQL files and 1 logon file in case you install this UDF via bteq.

Now you would be able to calculate table hashes – some SQL examples below.

2. tdth.zip

As outlined in the presentation we have to create a long string for this function – a TD internal implementation would be able to overcome this heavy burden, as also stated in the presentation.
This can make the usage somehow a bit inconvenient. To overcome this we developed a small java program to address this.
To set up the program (Java 1.6 or higher is needed on the computer where this should run) you have to do a bit of configuration – one time activity:

this will calculate the table_hash for the sys_calendar.calendar view.

The nice thing is that you can use this call also to generate the correct SQL call – if you remove the <process/> line from the base.xml parameter file it will only generate the SQL and not execute the SQL.
After the run you find in log the tdth_sql.txt (this name can be changed in the base.xml parameter file). This file contained the SQL which had been generated based on DBC infos (null info, datatype handling etc.). This is quite useful generate the SQLs and use them for direct execution on SQLA or TD Studio.
There are more options available (SQL instead of views / tables etc.) which are described in more detail in the documentation (folder doc).
You will recognise that the first run is slower then the repeating once – I guess the DB need to initialize each Java UDF once it runs first.

But also as mentioned above the Java UDF will be always slow – as it runs in protected mode. So, don’t run this with multi million row tables!
The C version in unprotected mode will run 100 times faster.

Finally some SQL examples (the base udf call was generated with above call) and the expected results:

SELECT
 yourDB.tablehash_sha1( 
 coalesce ( TRIM(CAST(calendar_date AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_week AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(weekday_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_quarter AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(quarter_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(quarter_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(year_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' )
 ) AS table_hash 
FROM
sys_calendar.calendar
;

expected result: a490387fbc9b93b74ea85d0312c23da7c6e8ae0e

As the udf is a standard aggregation function you can use the same call to get detailed values per group (which might allow to get the details where differences exists)

SELECT month_of_year,
 yourDB.tablehash_sha1( 
 coalesce ( TRIM(CAST(calendar_date AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_week AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(day_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(weekday_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_month AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(week_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_quarter AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(month_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(quarter_of_year AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(quarter_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' ) ||
 coalesce ( TRIM(CAST(year_of_calendar AS VARCHAR(75))), '109085beaaa80ac89858b283a64f7c75d7e5bb12' )
 ) AS table_hash 
FROM
 sys_calendar.calendar 
where year_of_calendar = 2014

group by 1

order by 1;
expected results:
month, table_hash
1, f62e7c2d6647907c9f90d9de260a5e3f48a2f121
2, f6c224f2be371f281c759af71a3664de76dbefdf
3, f6584c17e5e0b86db8be2eed3f4b5f8509b02b48
4, 6d17c7bae4f9d2525ee88706d0150a30baddceee
5, 656eda9686371c53de25718b093fb84c199f421d
6, eda1fa483e7f7cd14353ed84fff210c3c0074201
7, adabac9c58e95b6c7f3b2be58724e48f6bfd8745
8, 87cfeab49fee83278f3099d1278fc69e415d2f16
9, f33d69a0604219d9c6292b2508589b2f13180951
10,3ce090dcd97a585b9c3191a47cf617a8e3ab95b0
11,5575cf90bb46bed6a21e477cc249582e372a69d0
12,9ee1449915a7a67b8d8d0d9f7eb44f466943e45b

Given this – enjoy to try it and let me know if you have any questions.

Similar Posts