3 Different Approaches to Web Site Archiving – Which is best?

Information compliance regulations have been steadily getting much stricter over the past ten years or so. When it comes to compliance, most businesses focus on key areas such as printed marketing materials, emails, and press releases.

But no much thought is given to the company web site.

Archiving web site data is a special challenge for businesses. Often, this information might be spread out across multiple domains, IP addresses or web hosts. And this data is also often dynamic and constantly changing.

For this reason, a new approach to web site archiving must be taken in order to ensure that you’re adequately protected in case of litigation.

Web Site Backup

This process simply involves making frequent backups the company’s web sites, including any associated databases. This is the first, most obvious and (seemingly) simplest approach. However, it can cause problems in terms of both preservation and archive access.

In order for these archives to be accessible, a live version of the web site CMS will need to be preserved for the entire lifetime of the archive. Otherwise, it will be difficult to retrieve the data from the database.
However, this approach is well-suited to preserving hidden areas of the web site which might be missed by other archiving methods.

Transaction Archives

This method essentially involves placing an archiving device between the web server and the web browser. Whenever an HTML page is delivered to the client, and identical copy is also saved to the archives.

In some cases, this method could be considered much more practical than the Web Site Backup approach since the data is much more readable, and the logistics of retrieval are much simpler. Also, this method preserves hidden areas of the web site which would not normally be found by a web crawler.

The downside is that this can also create significant storage logs. (This is especially true for high-traffic, highly dynamic web sites)

Another controversy associated with this technique is the issue of privacy, since you’re technically tracking user behaviour. It’s one thing for Gmail to store your mails for you, but it would be another thing for Gmail to watch you while you read your messages.

Crawler-Based Archives

A third way of archiving web sites relies completely on a client-sided approach. Client-sided archives rely on a web crawler application that browses your site in a way that’s similar to what search engines do.

  • Start on the main page
  • Archive the HTML web page
  • Extract all of the internal links
  • Repeat until the entire web site has been crawled and archived

This approach is good because it preserves practical HTML flat files. Also, the process can be performed by a neutral third-party company, which can be good for litigation.

But probably the best benefit of a client-sided approach like this is that the archives are system-independent, and can easily be deployed across many web properties which are managed by the company.

The drawbacks of this approach are that it’s not very effective for finding hidden content, or content that may require a login and password for each user.

Also, this archive must be performed frequently on dynamic web sites. If someone posts a message and then quickly deletes it, a crawler-based application might not be able to catch it.