Guru: Use SQL To Find Duplicate Source Code
March 12, 2018 Ted Holt
According to Brian Tracy, “good habits are hard to develop but easy to live with; bad habits are easy to develop but hard to live with. The habits you have and the habits that have you will determine almost everything you achieve or fail to achieve.” This is as true in programming as in anything else we may do.
Unfortunately, even those of us who strive for good work habits often have to follow the work of people who did not. One bad habit I come across occasionally is known in software engineering as WET solutions. WET stands for “write everything twice” or “we enjoy typing” or “waste everyone’s time.” The antidote is the DRY principle: “don’t repeat yourself.”
Not long ago I had to modify a 13,000-line RPG program, the sort of thing that is beyond the capacity of my little brain to comprehend. I could tell there was repetition in the code, and how did I find it? I used SQL.
It may seem strange to use SQL for source code, but source code is data. It’s output from a programmer and input to a compiler. Since it’s stored in source physical files, using SQL to query it — and even to modify it — is a cinch.
A source physical file has three fields, which the Display File Field Description (DSPFFD) command will show you. They are: SRCSEQ (sequence number), SRCDAT (change date), and SRCDTA (source data). You will probably ignore the source date.
To query a source member, create an alias. If you query the source physical file itself, you will access the first member, which is not the first member alphabetically, but the one that was added first. It will likely not be the member you want.
create or replace alias qtemp.tempalias for somelib.somefile(somembr)
In this example, I cleverly named the alias TEMPALIAS and put it in the QTEMP library. When you reference TEMPALIAS in an SQL statement, the database manager will access member SOMEMBR in source physical file SOMEFILE in library SOMELIB.
Now let’s look for duplicate code.
with source as (select s.srcseq, s.srcdta from qtemp.tempalias as s where substr(s.srcdta,7,1) <> '*' and substr(s.srcdta,8 ) <> ' ' and substr(s.srcdta,6,1) = ' ') select a.srcseq, b.srcseq, a.srcdta, b.srcdta from source as a join source as b on trim(a.srcdta) = trim(b.srcdta) and a.srcseq < b.srcseq
I began with a common table expression, SOURCE, to select the records that I want to include in the query. The important part of this expression is the WHERE clause, because that’s where you specify which lines of source code you want to include in the query. I remove blank lines and lines with an asterisk in column 7 and only blanks following it. In this example, I also included a line to select only rows with a blank in column 6, which in the 13,000-line program meant free-form calculations only. There WHERE clause varies widely depending on the type of source code you are analyzing — fixed-form RPG, free-form RPG, DDS, CL, etc. — and the preferences of the person or persons who wrote the code.
In the main query, I joined the source member to itself, looking for lines that matched but with different sequence numbers. By selecting records where the sequence number in the primary file was less than the sequence number in the secondary file, I reduced the size of the result set and yet found the duplicate code I was looking for.
In this example, I used the TRIM function in the join in case the person who copied the code from one spot in the program to another shifted the code. In other situations, you may not want to trim the blanks. Your query doesn’t have to be perfect — you’re not running your business on it. It only has to find duplicate code.
Of course, you can also use SQL to look for code that was duplicated between two source members. You need only create two aliases.
You use SQL to help other people do their work. Why not use SQL to help you do yours?