If you've ever done any web log analysis with different web log analysis tools, then you already know that no two tools will give you the same result for the same data. They all process it differently and they all get different numbers. There are a few reasons why that is, and a lot of them relate to IP addresses, user agent strings, and incomplete or corrupted log lines.
Web log analysis results can vary upwards of several hundred percent. This does not inspire confidence. Some tools perform better than others of course, but the core problem boils down to results being logged in a poor fashion.
When you open a web log file, you're presented with what to most people is incomprehensible garbage, or patterned garbage. But there is real information in there. However, getting it out of the web log file in a way that preserves the reality of the situation is much more difficult than it looks though.
A lot of that has to do with HTTP being stateless. But it's not an insurmountable problem. There are some good tools out there that make a very valiant effort and present some pretty good data. ClickTrax and Web Log Storming come to mind as two good tools. Analog is free, and while I'm not always a fan of free software, it does a darn good job all things considered.
Some numbers tend to be regularly underestimated by many tools, while others are overestimated. An example of overestimation is file download counts. This is a notoriously difficult problem to solve, and very few real solutions exist. Pay Per Download (PPD) businesses rely on that count being accurate, and are one of the only reliable solutions out there for file downloads. The best known PPD program is from CNET at Download.com (or Upload.com if you're an author).
But I digress. The core problem comes from data fragmentation in log files and the inability to properly associate related lines after the fact. At best you can only guess at which lines are related to each other. Once the data is logged, you've lost the relationships and are at the mercy of guesswork to reassemble things.
The core issues boil down to data loss when the data is initially recorded, and poor data integrity. These two things are enough to drive a database programmer insane, or at least cause them to foam at the mouth with rage.
This situation isn't uncommon, and isn't limited to web logs. I mention web logs purely because it's a common example and the problems are both well known and well understood by many people.
I encountered an identical situation when talking to an executive at a software company. He talked about how they use Outlook shared folders to post notes about things so that they have a record and can look back on things up at a later date. Well, inside I was just seeing red. I'd seen the mess that the notes were, and that there was zero structure to the data. At best, if anyone wanted to go back and look over the records, they'd have to do a search for keywords, and then sift through trillions of irrelevant results. The data had zero integrity. But perhaps I'd better explain the purpose first.
The Outlook notes were there to provide a record partially for when they had new employees or employee turn over, and needed to get people up to speed on what had been done and where things were. This is pretty normal - you need your people to be informed.
However, a simple text record with no organization other than being in chronological order does little when you have many products, and overlap between products. How do you only get relevant results? The answer: you don't. The problem again boils down to data loss at the time of input, and zero data integrity (other than having a data attached to it).
I'm sure you can imagine similar situations where the wrong tools get used, resulting in data loss and poor data integrity. Once it's done, there is no real recovery. Any solution becomes a band-aid solution. It's like giving a shotgun to a sniper.
When it comes down to data that is important, it's best to measure how important it is and weigh how costly data loss would be.
For human generated data, a quick and cheap solution for the user is Microsoft Access. Don't laugh. It works. Better than a lot of IT people even know. Here's how...
While MS Access is a really poor choice for any database that has multiple users, it also has an option to create an ADP file, which is basically just a GUI to sit on top of another database like Microsoft SQL Server, which is a very capable data server. You just create the Microsoft Office Access Project (the ADP file), then layer it on top of MS SQL Server and configure your GUI to do what you need your people to do. It's simple, easy, and keeps your data organized in a way that you can easily search for relevant results.
Naturally you need to design your SQL Server database properly, but that's about as tough as it gets. Later on, if you're ambitious, you can create a custom client if you need more power when entering data.
These kinds of things aren't really all that hard to do, but they do require thinking about the problem a bit more. The payoff is data that is reliable and easy to use.
Cheers,
Ryan