Most efforts to improve programmer productivity and software quality fail to generate lasting gains. New languages, new project management and the rest are decades-long disappointments – not that anyone admits failure, of course.
The general approach of software abstraction, i.e., moving program definition from imperative code to declarative metadata, has decades of success to prove its viability. It’s a peculiar fact of software history and Computer Science that the approach is not mainstream. So much the more competitive advantage for hungry teams that want to fight the entrenched software armies and win!
The first step – and it’s a big one! – on the journey to building better software more quickly is to migrate application functionality from lines of code to attributes in central schema (data) definitions.
Data Definitions and Schemas
Every software language has two kinds of statements: statements that define and name data and statements that do things that are related to getting, processing and storing data. Definitions are like a map of what exists. Action statements are like sets of directions for going between places on a map. The map/directions metaphor is key here.
In practice, programmers tend to first create the data definitions and then proceed to spend the vast majority of their time and effort creating and evolving the action statements. If you look at most programs, the vast majority of the lines are “action” lines.
The action lines are endlessly complex, needing books to describe all the kinds of statements, the grammar, the available libraries and frameworks, etc. The data definitions are extremely simple. They first and foremost name a piece of data, and then (usually) give its type, which is one of a small selection of things like integer, character, and floating point (a number that has decimal digits). There are often some grouping and array options that allow you to put data items into a block (like address with street, town and state) and sets (like an array for days in a year).
One of the peculiar elements of software language evolution is whether the data used in a program is defined in a single place or multiple places. You would think – correctly! – that the sensible choice is a single definition. That was the case for the early batch-oriented languages like COBOL, which has a shared copybook library of data definitions. A single definition was a key aspect of the 4-GL languages that fueled their high productivity.
Then the DBMS grew as a standard part of the software toolkit; each DBMS has its own set of data definitions, called a “schema.” Schemas enable each piece of data to have a name, a data type and be part of a grouping (table). That’s pretty much it! Then software began to be developed in layers, like UI, server and database, each with its own data/schema definitions and language. Next came services and distributed applications, each with its own data definitions and often written in different languages. Each of these things need to “talk” with each other, passing and getting back data, with further definitions for the interfaces.
The result of all this was an explosion of data definitions, with what amounts to the same data being defined multiple times in multiple languages and locations in a program.
In terms of maps and directions, this is very much like having many different collections of directions, each of which has exactly and only the parts of the map those directions traverse. Insane!
The BIG First Step towards Productivity and Quality
The first big step towards sanity, with the nice side effect of productivity and quality, is to centralize all of a program’s data definitions in a single place. Eliminate the redundancy!
Yes, it may take a bit of work. The central schema would be stored in a multi-part file in a standardized format, with selectors and generators for each program that shared the schema. Each sub-program (like a UI or service) would generally only use some of the program’s data, and would name the part it used in a header. A translator/generator would then grab the relevant subset of definitions and generate them in the format required for the language of the program – generally not a hard task, and one that in the future should be provided as a widely-available toolset.
Why bother? Make your change in ONE place, and with no further work it’s deployed in ALL relevant places. Quality (no errors, no missing a place to change) and productivity (less work). You just have to bend your head around the "radical" thought that data can be defined outside of a program.
If you're scratching your head and thinking that this approach doesn't fit into the object-oriented paradigm in which data definitions are an integral part of the code that works with them, i.e. a Class, you're right. Only by breaking this death-grip can we eliminate the horrible cancer of redundant data definitions that make bodies of O-O code so hard to write and change. That is the single biggest reason why O-O is bad -- but there are more!
The BIG Next Step towards Productivity and Quality
Depending on your situation, this can be your first step.
Data definitions, as you may know, are pretty sparse. There is a huge amount of information we know about data that we normally express in various languages, often in many places. When we put a field on a screen, we may:
- Set permissions to make it not visible, read-only or editable.
- If the field can be entered, it may be required or optional
- Display a label for the field
- Control the size and format of the field to handle things like selecting from a list of choices or entering a date
- Check the input to make sure it’s valid, and display an error message if it isn’t
- Fields may be grouped for display and be given a label, like an address
Here's the core move: each one of the above bullet items -- and more! -- should be defined as attributes of the data/schema definition. In other words, these things shouldn't be arguments of functions or otherwise part of procedural code. They should be just like the Type attribute of a data definition is, an attribute of the data definition.
This is just in the UI layer. Why not take what’s defined there and apply it as required at the server and database layers – surely you want the same error checking there as well, right?
Another GIANT step forward
Now we get to some fun stuff. You know all that rhetoric about “inheritance” you hear about in the object-oriented world? The stuff that sounds good but never much pans out? In schemas and data definitions, inheritance is simple and … it’s effective! It’s been implemented for a long time in the DBMS concept of domains, but it makes sense to greatly extend it and make it multi-level and multi-parent.
You’ve gone to the trouble of defining the multi-field group of address. There may be variations that have lots in common, like billing and shipping address. Why define each kind of address from scratch? Why not define the common parts once and then say what’s unique about shipping and billing?
Once you’re in the world of inheritance, you start getting some killer quality and productivity. Suppose it’s decades ago and the USPS has decided to add another 4 digits to the zip code. Bummer. If you’re in the enhanced schema world, you just go into the master definition, make the change, and voila! Every use of zip code is now updated.
Schema updating with databases
Every step you take down the road of centralized schema takes some work but delivers serious benefits. So let’s turn to database schema updates.
Everyone who works with a database knows that updating the database schema is a process. Generally you try to make updates backwards compatible. It’s nearly always the case that the database schema change has to be applied to the test version of the database first. Then you update the programs that depend on the new or changed schema elements and test with the database. When it’s OK, you do the same to the production system, updating the production database first before releasing the code that uses it.
Having a centralized schema that encompasses all programs and databases doesn’t change this, but makes it easier – fewer steps with fewer mistakes. First you make the change in the centralized schema. Then it’s a process of generating the data definitions first for the test systems (database and programs) and then to the production system. You may have made just a couple changes to the centralized schema, but because of inheritance and all the data definitions that are generated, you might end up with dozens of changes in your overall system – UI pages, back end services, API calls and definitions and the database schema. Making an omission or mistake on just one of the dozens of changes means a bug that has to be found and fixed.
Conclusion
I’ve only scratched the surface of a huge subject in this post. But in practice, it’s a hill you can climb. Each step yields benefits, and successive steps deliver increasingly large results in terms of productivity and quality. The overall picture should be clear: you are taking a wide variety of data definitions expressed in code in different languages and parts of a system and step by step, collapsing them into a small number of declarative, meta-data attributes of a centralized schema. A simple generator (compile-time or run-time) can turn the centralized information into what’s needed to make the system work.
In doing this, you have removed a great deal of redundancy from your system. You’ve made it easier to change. While rarely looked on as a key thing to strive for, the fact that the vast majority of what we do to software is change it makes non-redundancy the most important measure of goodness that software can have.
What I've described here are just the first steps up the mountain. Near the mountain's top, most of a program's functionality is defined by metadata!
FWIW, the concept I'm explaining here is an OLD one. It's been around and been implemented to varying extents in many successful production systems. It's the core of climbing the tree of abstraction. When and to the extent it's been implemented, the productivity and quality gains have in fact been achieved. Ever hear of the RAILS framework in Ruby, implementing the DRY (Don't Repeat Yourself) concept? A limited version of the same idea. Apple's credit card runs on a system built on these principles today. This approach is practical and proven. But it's orthogonal to the general thoughts about software that are generally taught in Computer Science and practiced in mainstream organizations.
This means that it's a super-power that software ninjas can use to program circles around the lumbering armies of mainstream software development organizations.
Comments