February 03, 2012   Login   |   Register  
  Search  
  |  
  Home | Blog

The Renegade Blog

Minimize
 Print   

The Renegade Blog

Minimize
 Print   

The Renegade Blog

Minimize
Oct 28

Written by: Renegade
Saturday, October 28, 2006 11:00 PM

If you've ever done any web log analysis with different web log analysis tools, then you already know that no two tools will give you the same result for the same data. They all process it differently and they all get different numbers. There are a few reasons why that is, and a lot of them relate to IP addresses, user agent strings, and incomplete or corrupted log lines.

Web log analysis results can vary upwards of several hundred percent. This does not inspire confidence. Some tools perform better than others of course, but the core problem boils down to results being logged in a poor fashion.

When you open a web log file, you're presented with what to most people is incomprehensible garbage, or patterned garbage. But there is real information in there. However, getting it out of the web log file in a way that preserves the reality of the situation is much more difficult than it looks though.

A lot of that has to do with HTTP being stateless. But it's not an insurmountable problem. There are some good tools out there that make a very valiant effort and present some pretty good data. ClickTrax and Web Log Storming come to mind as two good tools. Analog is free, and while I'm not always a fan of free software, it does a darn good job all things considered.

Some numbers tend to be regularly underestimated by many tools, while others are overestimated. An example of overestimation is file download counts. This is a notoriously difficult problem to solve, and very few real solutions exist. Pay Per Download (PPD) businesses rely on that count being accurate, and are one of the only reliable solutions out there for file downloads. The best known PPD program is from CNET at Download.com (or Upload.com if you're an author).

But I digress. The core problem comes from data fragmentation in log files and the inability to properly associate related lines after the fact. At best you can only guess at which lines are related to each other. Once the data is logged, you've lost the relationships and are at the mercy of guesswork to reassemble things.

The core issues boil down to data loss when the data is initially recorded, and poor data integrity. These two things are enough to drive a database programmer insane, or at least cause them to foam at the mouth with rage.

This situation isn't uncommon, and isn't limited to web logs. I mention web logs purely because it's a common example and the problems are both well known and well understood by many people.

I encountered an identical situation when talking to an executive at a software company. He talked about how they use Outlook shared folders to post notes about things so that they have a record and can look back on things up at a later date. Well, inside I was just seeing red. I'd seen the mess that the notes were, and that there was zero structure to the data. At best, if anyone wanted to go back and look over the records, they'd have to do a search for keywords, and then sift through trillions of irrelevant results. The data had zero integrity. But perhaps I'd better explain the purpose first.

The Outlook notes were there to provide a record partially for when they had new employees or employee turn over, and needed to get people up to speed on what had been done and where things were. This is pretty normal - you need your people to be informed.

However, a simple text record with no organization other than being in chronological order does little when you have many products, and overlap between products. How do you only get relevant results? The answer: you don't. The problem again boils down to data loss at the time of input, and zero data integrity (other than having a data attached to it).

I'm sure you can imagine similar situations where the wrong tools get used, resulting in data loss and poor data integrity. Once it's done, there is no real recovery. Any solution becomes a band-aid solution. It's like giving a shotgun to a sniper.

When it comes down to data that is important, it's best to measure how important it is and weigh how costly data loss would be.

For human generated data, a quick and cheap solution for the user is Microsoft Access. Don't laugh. It works. Better than a lot of IT people even know. Here's how...

While MS Access is a really poor choice for any database that has multiple users, it also has an option to create an ADP file, which is basically just a GUI to sit on top of another database like Microsoft SQL Server, which is a very capable data server. You just create the Microsoft Office Access Project (the ADP file), then layer it on top of MS SQL Server and configure your GUI to do what you need your people to do. It's simple, easy, and keeps your data organized in a way that you can easily search for relevant results.

Naturally you need to design your SQL Server database properly, but that's about as tough as it gets. Later on, if you're ambitious, you can create a custom client if you need more power when entering data.

These kinds of things aren't really all that hard to do, but they do require thinking about the problem a bit more. The payoff is data that is reliable and easy to use.

Cheers,

Ryan

 

Tags:

3 comments so far...

Re: Web Logs, Data, Data Loss, Data Integrity, and Getting it Right to Start

Ahh...'they require thinking about the problem a bit more'! And that is why so many software tools are used in a fashion they were never designed to. Some people see a particular software program, such as Outlook, and have a 'brilliant idea' as to how it can be used more efficiently. Thing of it is, the idea isnt new and when you really look at it, not that efficient, either.
Being a s/w developer and a SQL Server DBA myself, I can relate...

Have a good one!

By Anonymous on   Wednesday, November 01, 2006 7:55 AM

Re: Web Logs, Data, Data Loss, Data Integrity, and Getting it Right to Start

It's very painful to see people use the wrong tools. Especially when you're forced to follow the herd and you know better... Each step is a painful reminder that we do live in a 'Dilbert' world.

By Anonymous on   Wednesday, November 01, 2006 2:41 PM

Re: Web Logs, Data, Data Loss, Data Integrity, and Getting it Right to Start

With a decking skill of 25, i can use my slow program on the Black IC.

By on   Saturday, November 04, 2006 9:30 AM

Your name:
Title:
Comment:
Security Code
Enter the code shown above in the box below
Add Comment    Cancel  
  

The Renegade Blog

Minimize
 Print   

Tweets

Minimize
 Print   
     
Renegade Minds About | Blog | Contact

  Search

Copyright 2010 by Renegade Minds   |   Privacy Statement   |   Terms of Use
Renegade Minds