29.9.2006
Programming (writing source code) is about communication. It is not just
laying down instructions which a computer then executes; writing source code
is also (and, as I would argue, primarily) to explain, at the same time, what
happens when the code is executed.
The audience of that explanation includes maintainers, other programmers
on the team, and, last but not least, your own future self. In short,
everybody who has to deal at a later time with your source code is someone
whom you talk to through the way that you write your code.
This is why there are such things as coding style and formatting
guidelines. Most people who are in charge of software projects (of whatever
sort, for example open source projects, or commercial software product
development) know from experience how hard it is to understand a
substantial code base a few months or even years after it has been
originally written; the task gets all the more difficult if the codebase is
heterogenous with respect to coding style, idiosyncratic formatting and
other such things. The project leaders therefore ask developers to follow
some rules which render the code base more uniform and thus easier
comprehensible. (This does not mean that these rules have to be forced on
the developers 'from above', of course; more often, a team that is
comprised of engaging developers will develop its own rules from a
consensus.)
The fact that writing source code is about communication with other
developers is also one of the reasons why side effects are bad. If you
write a program sequence that exploits side effects, this means that your
code actually does something different than what you
communicate it does. Reading or debugging such code can be incredibly
hard if done some time later, or by a different person (or both). And the
reason for this is not that such code is more difficult to grasp (in the
sense in which a tricky recursive algorithm may be more difficult to grasp
than, say, a simple loop). It is because the original programmer made a
misleading statement, through his source code, as to what the code was
supposed to do; and the programmer who later had to debug that code relied
on him and was thus misled.
Take an example: A while ago I was in a project that had some very old
database access code in some peripheral module which suddenly stopped
working. Nobody had changed that code (or the database), and no errors
occured. It simply gave empty result lists all the time. We had long
debugging sessions and couldn't find anything wrong in the code. It worked
quite well whenever we executed passages of it separately - only in context,
integrated in the system, it always returned empty results. Then we
double-checked that nobody had changed that code. We examined the entire
revision history in the source code control system. Nothing had been
changed - nothing, that is, but a few logging statements which had been
removed. We weren't suspicious of these logging statements at first, but
then one of the team members noticed: there was a logging statement which
did print the return value of a function, and that function was - a call to
a bit of initialization code which made the database connection and
retrieved all the data!
Now, this is a somewhat crass example. Most of the time the code that
exploits side effects is a bit more subtle - but that is only a difference
of degree. What matters is that the programmer who originally wrote this
did neglect the communication aspect of coding. Very probably he first
wrote that statement which caused the initialization, and at a later time
(presumably during some debugging), he added a return code and wrapped the
log statement around it. This makes sense only as long as one doesn't
consider how someone else would read the code. One simply focuses on the
task at hand, and for that it makes no difference where the initialization
statement is called. But the very moment that someone else (or the same
programmer a couple of months later) has to understand that code, the log
statement is quickly skipped when reading the code - because log statements
are, by convention, something that doesn't add functionality. (Many systems
are built so that logging may be disabled entirely when production mode is
entered.)
One might argue that the root problem in this case was that this
developer did not neglect such a fundamental of programming, he merely
violated a basic rule that applies to log statements: they shouldn't be
written in a way that they change the program state or the content of data
structures; they should be 'read-only'.
True, but that's exactly the point: log statements should not trigger
side effects on which the main control flow depends. And that is precisely
because nobody expects log statements to do that. It is a convention
that helps to read source code (because it enables the reader to skip
inessential lines of code). It is a convention, that is, that facilitates
communication between developers (or between earlier and later selves of
the same developer) about the program code in question. It is not at all
necessary for the compiler or the runtime system to execute the code.
In one of my earlier
posts, I wrote about indirect programming, and what makes it so bad.
The discussion above shows an additional aspect of indirect programming:
it is the analogue, on a software design level, to what side effects are
on the more basic coding level.
Passages of a program that have been generated by indirect programming
are much more difficult to understand because what they actually do is not
what they seem to do. (Often enough, they misleadingly seem to do nothing
sensible at all - in that respect they are very similar to the logging
statement in the example above.)