Index Keys in a Business Database
By David M. Woods
Published September 15, 2006, 11:13 am in News.
Many in the Information Technology business have studied database design at impressive universities, and most can probably quote the definitions of 1st and 2nd and 3rd level normalization. They are certified experts with impressive credentials hanging in frames on the wall.
Then there are the "seat-of-the-pants" programmers who had to figure out how to use databases all on their own.
But regardless of which group you fall into, all need to go back to the basics from time to time and discuss the fundamentals. Even the most well-educated I.T. pro may find that, once they leave the protected confines of academia and start actually developing software out in the "real world", there are issues that the textbooks didn't cover. And so, this paper will discuss an issue that all business database developers will contend with sooner or later: index keys.
The index key, for the uninitiated, is what we use to uniquely identify a specific row or "record" in a database table. There are many ways to select and/or construct a good and proper index key.
Topics to be discussed also go beyond mere database design. User-interface issues and concepts will be covered. Some of the concepts and recommendations herein are widely-accepted principles, taught at all the good schools and practiced by all seasoned I.T. pros. Others are just my humble opinion, garnered from over 25 years developing and supporting business database systems. But all of these recommendations should be useful to the "real world" programmer.
Rule #1: KISS
So many programmers forget the most basic of all rules: keep it simple, stupid. Actually, I have an even better rule: KIS&S: keep it simple and short. All other things else being equal, a short index key is better than a long index key any day. Look for ways to reduce index length. If, for example, you are using a numeric index and you estimate that the table will contain about a thousand records, then a 4-digit index is more than adequate; don't use a 10-ten digit index.
Secondly, don't create additional fields for indexing purposes if you don't have to. A database tables should never contain fields that can be done without. (More on this below under the topic "From whence an index?")
Unique or not unique, that is the question.
There are some rare cases where a unique index is not even needed. An example would be a table that stores only temporary data, especially if it contains data that will soon be crunched up into a more "permanent" table. Remember that even non-unique indexes can be used to optimize row sorting and ordering.
From whence an index?
For a database table where records need to be individually accessed, the rules for an index key are: 1) it must be unique, 2) it must be always available, 3) ideally, it should be short, and of a consistent format.
The ideal index key is a "natural" key. A natural key is one that already exists, as opposed to an "artificial" key, which is a special field created for no other purpose than to be the key. For example, if the table will contain invoices entered from a pre-numbered paper pad, then use the number that is already on the paper. If the table is to contain information about zip codes, then use the zip code! Natural keys are always more user-friendly than artificial keys.
The worst possible table design is to create an artificial key when a natural key already exists! It clutters the table with unnecessary data, a blatant violation of the KISS principle if there ever was one.
Some Microsoft database products do this automatically. In fact, the software documentation "recommends" always adding a special field for indexing, regardless of whether you need it or not.
On the other hand, sometimes a natural key that satisfies the rules does not exist. In this situation, you may have no other choice but to create an artificial key. Again, it must be emphasized that this should only be done after a rigorous search for a natural key comes up empty. (See "Parents & children" below for additional details.)
Parents & children
"Child" tables are a special situation. An example is an invoice "parent" that has multiple "lines" (children), where each line might contain an inventory item and quantity. From whence an index key for the children?
Some designers create another global index key for them. But this design is not recommended, because you end up with something like: "Invoice# 12345, line# 98765" which is not very user-friendly.
A better design would be to use a generated "line#" for the children. A line# starts over at 1 for each parent. The last line# used can be added as a field on the parent record. You end up with something like: "Invoice# 12345, line #2." This is friendlier, from the user's point of view, because "line #2" actually has meaning: it means the 2nd detail line on this invoice.
Yes, this does create a compound index: an index key constructed from multiple components. Compound indexes are not a problem at all for modern database managers, provided we keep them nice and short. But by the same token, don't create a compound index if it isn't needed; that violates the KISS principle. For example, the invoice# in the above example does not need a compound index; a simple numeric index will work just fine.
Numeric indexes
Either numbers, letters, or a combination thereof can be used for index fields. Alphabetic indexes have the advantage of being easier for the user to remember. If the table will contain a relatively small number of rows, and significant growth is not expected, consider using a brief alphabetic code for the index.
But for large tables, or tables that grow continuously, numeric indexes are the way to go.
There are several ways to incremented a numerical index. The easiest is to use the "auto-increment" feature (also known as "identity") found in most database managers. This may be applicable for a short-term, quick-and-dirty application, but for a more serious, professional enterprise application, it is a poor choice, for several reasons. One pitfall is that data rows from tables with this type if index key are difficult to move, copy, import, and export. Another problem is that the new key is not "revealed" until after the new record is added; from a user-friendliness perspective, it is more advantageous to display the new index value as soon as the user requests one.
Another technique to create a numeric index key is to increment the key of the last record, then append the new record. But this method is rather resource-intensive. In a large, heavily-used system, a "race" condition (two or more users all trying to do the same thing) might not produce a unique key. Also, if the user aborts the process, you are left with an empty "garbage" record.
The preferred incrementing technique is to use an external numbering source, such as a separate table for that purpose only. The system gets the last number, increments it, and saves the new number. There is just one numeric field to increment, and the system can "take its time" appending the new row to the target table.
Visible versus invisible index keys
Some developers are of the opinion that index key fields should be hidden from the end user. The underlying logic is that index keys only have relevance to programmers, and thus end users don't need to see them. The flagship application of this concept is QuickBooks, which has no visible codes for anything, not even General Ledger accounts.
This concept is faulty on several fronts. First of all, there is nothing wrong with users seeing and knowing index fields; in fact, it actually helps him/her better understand what the system is doing and how it works. It also better allows the user to communicate with the programmer.
It's like a Navy ship. On a Navy ship, all of the pipes, conduits, wires, cables, and control lines are out in the open, in plain sight. Is this a military version of decoration? No - those things are there so to facilitate maintenance! The shipbuilders could have hidden them all behind some aesthetically-pleasing cover, but chose not to, and for good reason. Besides, each time a sailor sees all those exposed pipes and things, he's learning some technical details on how the ship works. And it's better when the sailor can say "There's a leak coming from this green pipe labeled R-27-F-193!" than "There's a leak coming from behind this mahogany panel!" Software should be designed likewise: let the user see the details; it won't hurt him, and he'll probably be a better user for it.
Also, not all database operations will be done from the same front end that users use. Auditors, technical support personnel, and programmers who need to diagnose a problem or tweak the system will likely use a back-end database manager where they can see pure data in its raw glory, and it makes their jobs much easier if they have important facts like key field values available.
Then you have your power users. These are your "keyboard jockeys" with awesome typing speed who don't even use the mouse. They prefer to memorize things like codes and index values, and enter them manually.
A final reason for exposing all key values is that it allows the system to keep working even if the referential integrity gets broken for any reason. If the programmer uses "outer" joins wherever possible, then if, for example, the customer record for customer #1234 gets deleted despite the fact that customer #1234 has a thousand invoices, then all the reports and queries will still work. Diagnosing and fixing the broken reference will be a snap - but only if the "1234" customer codes displays on all the reports and queries.
Must index values be consecutive?
Here's a controversial topic. Some accounting analysts are of the opinion that all transactional numeric index values must be consecutive, and that a missing value indicates fraud, or a system failure.
Consecutive numbering has a place in the accounting world for documents like bank cheques, because the blank paper itself could be used to defraud the company. If a cheque is missing, it was likely stolen, and the bank can be notified to be on the lookout for it. For most other types of transactions, however, this concept does not apply. For example, if a sales invoice number is missing, what do you do when a customer bearing that number shows up to pay his bill? Accuse them of fraud?
Neither can we accuse employees of fraud attempt if a key number is missing. An employee who wishes to defraud the company has a plethora of ways to do it. On a sales invoice, for example, he/she could understate quantities, or leave off line items, or omit some merchandise, or better yet, skip the invoice creation entirely - which, incidentally, is the best way because it leaves no paper trail.
Meanwhile, employees need to be able to correct errors. We live in an imperfect world, and goof-ups, screw-ups, mistakes, and unanticipated changes are going to happen. Requiring consecutive transaction numbers means that the system must be loaded down with excessively complex routines to force these numbers into the required pattern. KISS just went out the window. Plus, it requires that empty or "voided" transactions to be added to the database, which are just clutter than contain absolutely no useable data.
Fraud prevention is indeed a serious issue, but it is beyond the scope of this document. There are many books and articles on it.
How many indexes?
How many indexes should a table have? The short answer is: as few as possible.
The long answer is: it depends on how the table will be used. A table that is read-only can take more indexes than one that is continually written to.
For your typical database table that probably gets a fair amount of inserts and updates, a good rule of thumb is: two indexes, maybe three. (But just make one if that's all you need.)
One way to reduce index proliferation is to design compound indexes effectively. For example, if you have one compound index on field #1 + field #2, and another compound index on field #1 + field #2 + field #3, note that the first of these indexes (the 2-field index) is redundant and unnecessary, because its functionality is already covered by the 3-field index.
Conclusion
Hopefully, you have picked up some good programming ideas from the concepts and recommendations above. Some of these concepts are pretty standard stuff, others are debatable. But all of them have as their goal to help produce software that is reliable, maintainable, flexible, user-friendly, and efficient.

Comments & Trackbacks
No Comments/Trackbacks for this post yet...
This post has 4 feedbacks awaiting moderation...
Leave a comment