When Bad Things Happen to Good Computers
|
Why doesn't Apple just FIX THIS THING!!!
(or "When Bad Things Happen to Good Computers.")
by Tracy Valleau
One of the most frustrating things for computer
users is the installation of a new piece of software... that
doesn't work, or worse yet crashes the computer.
The swear words are often enough to embarrass
the most savvy sailor.
These are usually followed by the posting
of letters maligning the software, the publisher, the programmers,
the computer manufacturer; and any other nearby and convenient
target.
"Why doesn't (name) just fix this?"
is the usual plaintive whine.
Well, here's why: because "just"
doesn't apply.
There are only a few things that can
cause software to misbehave: 1) hardware failure; 2) programmer's
error; 3) software conflicts; 4) input or user error.
Hey: if your hardware is broken, well
- fix it. There's nothing the software is going to do to repair
your broken hard drive.
However, in this category, there is one
insidious hardware/software problem that must be mentioned: the
file tracking done on your hard drive with drivers.
All your software is stored on your hard
drive, just as a collection of short stories is stored in a book.
To find a given story, you consult the
table of contents, and it directs you to the proper location
(page) in the book. You start reading, and when you finish one
page, you assume that turning the page will take you to the next
words in the story.
But suppose that there was a printing
error, and you're 12 pages into "War and Peace" and
turn to page 13, only to discover that page 13 contains the 33rd
page of "Skippy the Bunny!"
In essence, this is what happens when
the computer hard drive's directory gets corrupted.
In fact, it's much more complex than
that, since, instead of each page following one on the other,
a computer has to return to the table of contents (so to speak)
to find out where the next page is, just as if you ripped out
all the pages of "War and Peace," threw them up in
the air and then gathered them back together in totally random
order. (This happens on a drive because you save and delete things
of different sizes, leaving bits of a file scattered in different
places.)
If that "trail" of page after
page gets even one page out of order, then the program crashes
because it loaded in the wrong thing, and when it goes to work
on it or to use it, what it expects to be there is something
entirely different.
If you put on a parachute, boarded a
seaplane and jumped out, you'd be in deep trouble if you did
so before it took off, because instead of the air you expected
to be falling through, you'd be in over your head in water.
Same thing with a computer program: if
it expects to find one thing, but finds another because the directory
was wrong, and it loaded in something else - you get "the
big thud."
How do you prevent directory corruption? Well,
if you crash, or notice anything suspicious, take the time to
run a directory repair program. The longer a corrupted directory
is used, the more corrupted it becomes.
Other problems related to hardware, including
loose cables, failing power supplies, bad RAM, drives on their
last legs and so on, can all cause problems that show up in your
software. (Power supplies and hard drives both have maximum life
of about 5 years, give or take.)
Programmer's errors not only do occur,
but it is an axiom in the trade that there is no such thing as
bug-free software. Why? Simply because you'd have to check at
every single line of code for every single possible contingency
to cover all your bases. That would produce code so bloated that
no machine could run it; so expensive that no publisher could
afford it; and so impossible on the face of it, that it could
never be done (since it would require that once the software
was written, all change in the universe come to a complete halt,
so that no different contingencies would ever arise.)
So, programs are written to cover all
the most common problems and foreseeable errors. But even this
has to be weighed in the light of reality. One recent iteration
of a popular operating system was released with 22,000 known
bugs! (That doesn't include the probable doubling of that number
in unknown bugs once a few million users started mucking about
with it on a few million different machines, either!)
Hey! We're all human, and programmer
errors do happen, although languages such as C++, and the use
of libraries of prebuilt and pre-debugged code have gone a long
way toward cutting that source of errors down.
There are reasonably sophisticated automated
debugging tools which encapsulate each line of code, and can
track down such mysterious things as "dangling pointers",
and "undisposed handles" (ways that programs access
memory.)
Good code will "TRY" a process,
and if it doesn't work, "CATCH" the error. Problems
caused by programmer error are now WAY down from the early years
of personal computers.
So, software problems come down to true
errors (the programmer used an "add" command when he
should have used "multiply" for example (which no software
debugging tool can detect) and "bullet-proofing" -
trying to keep the unforeseen from totally messing everything
up.
This last is the most common problem,
exacerbated by the fact that there are no two computers alike,
except as they come off the assembly line. We add software, create
documents; hang new hardware off it; add memory; update the operating
system; customize the desktop; and so on. Honestly, the fact
that most software runs as well as it does on 99.9 % of the machines
in the world is nothing short of amazing. (Which is of little
consolation when it's you that is the remaining 0.1%...)
We're about to see what that means when
it is you that's affected.
Software "conflicts" are by
far the greatest single source of problems. In computer speak,
a process (much like in normal speech) is just something that
is going on. Running software starts a process; any software.
Merely starting your machine in the morning starts hundreds of
individual processes - some running the monitor; the keyboard;
the drives; managing the memory and so on.
A conflict is a case where some combination
of processes running on a given machine alters the state of that
machine to the point where one (or more) processes can no longer
function properly.
Let me give you an example. Process ONE
wants to remember something, so it stores it in memory location
"A". Process TWO wants to remember something, and also
stores it in memory location "A". Process ONE now decides
it needs what it just stored, and so retrieves it from memory
location "A"... but now it's not what ONE stored there,
but what TWO stored there.
ONE then tries to use that retrieved,
and incorrect information, and, like slipping on the parachute,
instead finds itself under water. Crash.
Now, that is a crude and, with modern
systems and coding, unlikely example... but it's a perfectly
valid one.
From the user's standpoint, his/her machine
ran "just fine" until s/he installed the software that
is Process TWO. Then the machine started crashing.
Must be the new software's fault, right?
Er... why? Maybe process ONE should not have
been using Memory location "A". Maybe process ONE is
the only software in the world that uses memory location "A"
except for process TWO.
You might report the problem to the publisher
of TWO, who might spend days, weeks, months trying to find the
problem, when, in fact, there is absolutely nothing wrong with
TWO.
Now this example is one of the easier
kinds of problems to find, if you're willing to work with the
publisher to help him find it. You can simply try removing everything
and seeing if TWO continues to crash. (It won't, as long as ONE
isn't running.) When you finally run ONE, and the crash in TWO
happens, you'll have started toward finding the solution.
Where it gets hard is if ONE only uses
Memory location "A" because some other process has
made a change in the machine, requiring "A" to be used
instead of ONE's preferred "B".
Yes - that was supposed to confuse you.
If you're confused, and you're a thinking, adaptable human being,
you'll begin to see how a blind dumb machine following a fixed
set of rules, is prone to errors.
There is no "almost perfect"
in computers; no "nearly right." If things are not
perfectly set up, then bad things happen to good machines.
That is why the first thing a tech-support
person will ask you to do is to restart the machine in a known
good state, and then run the software to see if the problem still
exists. Starting to diagnose what is going on from any other
place is just plain impossible. (At least that's what happens
with most Macintosh diagnosis; with Windows machines, the problem
is multiplied several fold because of the intricacies of INI
and a few dozen other setup files that are modified each time
new software is installed.)
So, on the one hand, there's the possibilities
of an infinite variety of software, and on the other, an infinite
variety of machine configurations. And all it takes is one of
those to not be perfect under all circumstances.
Finally, you may have triggered the error.
Loading up a project incorrectly; telling the software that you're
using 48K sound when you're using 32K, and then saving the project;
switching from 32K to 48K in the middle of a tape; doing a forced
break out of a program when it's writing to disk; restarting
a program after a crash, without restarting the machine first;
and so on.
Finally, there's the one obvious one,
that applies, naturally enough, to everyone in the world except
you - read the instructions! Your expectations of what the program
should do, may not be what it's designed to do. That is not,
my friends, a bug.
The truth be known, of all the tech support
calls and bug reports a company gets, perhaps one in every 800-1000
is actually a problem with the software. Most often, it's "operator
error" followed, at some distance, by software conflicts.
Real life -
However, some bugs are real. Here's an anecdote
about finding one with Final Cut Pro 2.0.
The 2-pop board contained a couple of
references to crashes when logging material. The usual suspects
didn't seem to fix the problem. I noticed the problem myself,
and had logged on to 2-pop to see if I was alone, and discovered
that I was not.
So, as an experienced debugger, I started
off tracking it myself.
First, I restarted with only the recommended
extensions: no luck. (But that's where you should start, too!)
I adjusted the memory: no luck. I trashed the preferences: no
luck. I reinstalled the software: no luck. I looked at 2-pop:
no luck.
I was intrigued. I next thought about
what was obviously different between 1.25, which worked just
fine, and 2.0, which crashed. The most obvious thing was in my
face: the Audio Metering window (AMW).
So I closed it... ... and the problem
went away.
Over the next several hours, I tried
that new-found technique under various system configurations,
and in every single case, closing the AMW fixed the problem,
and in every single case opening it caused the problem. At this
point I was pretty confident that I could post the technique
to the board, and did.
As a developer, I also posted the issue
to Apple's DTS (Developer Technical Services) bug board.
Within 90 minutes I had received a call
from Apple.
Folks, this was to be expected from a
quality software company - any such company, not just Apple,
because we had a reproducible problem, shared by others, with
a symptomatic relief that always worked.
Real problems, reproducible problems,
are something that programmers and publishers want to catch and
fix: it's only good business. It makes them money. No quality
publisher is going to turn down the chance to improve his product,
especially if it's malfunctioning! Really.
Compare that reproducibility with a tech
support call like this : " You sleazy &**$!!!'s! Your
software is worth %$#! I'm never going to buy your @#$! software
again!"
"What seems to be the problem?"
"It crashes!"
er... um...
The lesson here is this: most people
DON'T have your problem with the software; it's a problem likely
unique to your machine. So, if you want tech support to help
you, don't count on them being mind-reading magicians.
Remain calm. REPRODUCE the problem and
write down the steps. Then, as a team, you and support can find
the solution.
But I digress...
Apple, as I said, was very concerned
about the problem. We spoke on the phone several times and exchanged
dozens of emails. The folks at Apple were there until well after
closing on a Friday. We tried a several different things. But
no matter how much we tried, it always crashed for me and it
never crashed for them.
When Monday morning rolled around, they
were right back on the case. Between us, we tried reinstalling
the OS; installing as yet unreleased "OSen." I installed
MacsBUG and printed out stdlogs, showing the internals of the
problem. I hooked up different hardware.
All to no avail. Then, at Apple's suggestion,
we tried pulling the RAM, since I had 1.5 gigs installed. The
suggestion was that the RAM, or the new firmware upgrade, might
have caused the problem. (That is, it was beginning to look like
a hardware problem, since nothing I did, from clean system installs
on, seemed to make a whit of difference.)
With the hardware looming larger as a
possibility, I contacted the other person (Mark) who had the
crash, and we exchanged hardware setup information.
There was almost nothing in common: his
was a nice clean setup, while mine was a cluttered mess of things
- yet we both crashed. But I did notice one thing, and it lead
right back to Apple's RAM suggestion: we both had 1.5 gigs of
RAM.
Mark tried pulling some of his RAM. He
also reset the memory allocations so that FCP would use most
of what was available as he did the RAM changes.
And the problem went away.
So: was it hardware?
The clue came when Mark reported back
that he had reset his RAM allocation (minimum and preferred sizes)
to 500000K and 800000K. Mine were set considerably smaller.
So I reset my sizes to match his... and
the problem went away for me as well.
So, there we were. The problem was related
to hardware, but was, in fact, a software problem, caused (most
likely) by an assumption made somewhere along the line in the
programming that an address in memory was in one memory bank
(one DIMM, as specified by the upper bytes of the address itself)
when, in certain specific circumstances, it was in another. By
resetting the memory allocations, we forced the memory to be
in the same bank, and the addressing error went away.
It took the concerted effort of many
people, at about 72 hours each, to find the problem and provide
enough information so that Apple could reproduce the problem
and eventually fix it.
What's to be learned from this anecdote?
For one thing - Apple IS listening and cares about the product.
They responded almost instantly, and stayed with it until the
problem was identified and fixed. But, without cooperation and
help from the user (me and Mark) the problem could have lingered
for weeks, or forever. Patience and perseverance prevailed.
So, when bad things happen to good computers,
here's what you can do:
Remain calm (not always easy to do, I know).
Try to determine if it's just something
on your machine:
a) is your software / project / document
set up correctly? Was everything just fine until your last effect
was added? If so, did the project file get corrupted somehow?
(Now you can see why backups are important, no?) Is the media
corrupted?
Did you recently add some new software?
Run an installer of any kind? Change your setup? Crash recently
(even when using some other program) possibly corrupting files?
Did you just defragment your drive? (This
is notorious for corrupting files - before you defragment a drive,
back it up!)
Mac users: try trashing the preferences
for your software; try rebuilding the desktop; run Disk First
Aid, or DiskWarrior or TechTool Pro. Restart with only the extensions
you need for the software - if that works, you've got an extension
conflict.
PC users: try running Norton to verify
the directory structure. Other than that, I'd have to say : call
your IT guy, 'cause I don't know squat about how you'd find it
on a PC, outside of returning the machine to a previously known
good state, or uninstalling whatever you think might be the cause.
The point here, in both cases, is to
try to find the problem yourself first.
Why should you? Well, as we just say,
the odds are about 199 in 200 that it's something particular
to your machine's current state, and finding it and fixing it
yourself will get you up and running faster.
If you take these preliminary steps,
then if you need to call tech support for the product, you'll
be able to tell them that you've already done those steps, since
that's exactly what they'll have you do anyway.
When and if you do call, remember that
the guy or gal on the other end of the phone didn't write the
software, and isn't personally responsible for your particular
problem.
Remember that he or she has already be
berated, yelled at, cursed, vilified and threatened 84 times
today alone. So, if your goal is to solve your problem and get
back to work as soon as possible, consider this counterpoint:
be NICE. Overwhelm them with courtesy. Politely explain the problem,
and the steps you've already taken to solve it. Have all the
information at hand (such as your OS version; amount of RAM;
serial number of the product and so on.)
After 84 jerks, you'll be a ray of sunshine
- manna from heaven - such a delightful relief that they will
work with you until hell freezes over to resolve your issue.
This works for me every time.
However, if you do get someone who has
had one too many jerks, and is being rude to you, simply hang
up and try again. If it's a big company, you'll get someone else
on the second try. If you don't, then ask for a different tech,
explaining that you're being civil, and expect the same in return.
Finally, there's this - try some preventative
maintenance.
You don't expect your car to run forever
without oil-changes, refilling the tank and tune-ups; treat your
computer the same way. Don't load on every extension; utility;
enhancement; gosh-O, golly-gee-whiz doodad that comes your way.
Rebuild the directory every so often - not just when trouble
appears. Stay up to date with your OS - use the most recent version
(because it's likely to have fewer known bugs) as well as the
most recent version of your software (for the same reason).
Mac users can run their software updater
control panel; PC users - check the MS website for the latest
DLLs and system updates.
Be aware that this maintenance is an
ongoing process, and take the time to do it. Remember that software
(such as Final Cut Pro) is developed to take advantage of the
latest changes in the OS; and the OS evolves to provide bug fixes
and new features. In short, these happen in parallel, so don't
get out of sync here. Don't expect to run the latest software
on an old OS, nor a new OS with old software. If you do, you're
asking for trouble.
This is hardly an exhaustive treatment
of what can go wrong - there's 6 year old hard drives; bad removable
media; faulty cables; and a few zillion other things.
But with patience and logic and perseverance,
you'll be able to solve 90% of the problems yourself, and be
back up and running, even when bad things happen to good computers.
Copyright 2001 Tracy Valleau
About the author
Tracy
Valleau started
programming computers in 1978, and is credited with one of the
earliest multimedia / hypertext programs (1988). He is currently
the CEO of Digital Light Studios, Inc., a multimedia and consulting
firm whose clients have included McGraw-Hill, Sony, Apple, Silicon
Graphics and others. Mr. Valleau can be reached at tracy@DigitalLightStudios.com.
|