Access Keys:
Skip to content (Access Key - 0)

Q: Archiving a Website (Drupal Cloud and Managed Server)

If you have rebuilt a website on another service, or are retiring a website, and you request an archival copy of the old website, you will want to consider what archive formats are available and how this archive might need to be restored in the future.

Context

The focus for this is for active content websites (such as WordPress or Drupal), where the content includes:

  • Configurations managed on the server
  • The code shipped as part of the CMS
  • Files uploaded by users of the CMS into the docroot
  • (possibly) Custom themes and modules written for the website
  • A database (usually MySQL) containing most of the website's page contents and configurations

Answer

Database Dump and Docroot

Providing a dump of the website's database, alongside a tar or zip archive of the website's file space, provides all the data needed to resurrect the website somewhere else.

Pros:

  • This is the only format that gives you essentially the complete website that can be launched somewhere else.
  • This is the only format that will fully preserve structured data in a format that can be read programmatically.

Cons:

  • In order to read the contents in any human-readable form, you will need a LAMP server or hosting environment to run it in. Most people with the title "web developer" should be able to accomplish this.
  • While re-launching the website within a few months is possible, bringing back online a website that is years old is often fraught with security and versioning concerns, and becomes a much larger project. (Expect that within 2-5 years, restoring this dump will be difficult and expensive.)
  • Drupal Cloud in particular won't be able to re-launch a website from a dump you provide; for consistency of the service we insist that Drupal Cloud sites be updated in lock step and do not accept outside database edits.
  • Handle the dumps carefully: any user accounts within the CMS will have passwords that are hashed and saved in the database dump, and can potentially be cracked if the dump is exposed.

Server Backups

The normal process of retiring a Managed Server involves archiving the server's final TSM backup for one year post-decommission; additionally, the server's image will be retained in a bootable state for one month. Note that recovering a website from this backup is comparable in work to re-hosting the website with us, and includes most of the disadvantages of recovering from a docroot and database dump backup.

Use print to PDF or wget to crawl a website

If the website is only a few pages using your browser's print to pdf or save page as is a perfectly viable way to save reference copies. This will allow you to view the files locally and search for text within them.

If the website has many pages, you might be able to use wget to get a copy of the site. Search online for more information about wget. A reasonable invocation is:

wget --recursive --convert-links --page-requisite --no-parent "http://website.example.url/"

Since PDF and HTML files 20+ years old are still easily accessible to modern clients, you can expect that this kind of copy is reliably archivable.

Use web.archive.org (third party service)

You can request a snapshot kept on web.archive.org (if the site hasn't already been crawled naturally). This is a third party service that we're not affiliated with.

Keep the website online as a -archive site

Depending on the situation, it's often reasonable to rename the website, and leave it active, while hiding it from search engines. Use the website's maintenance mode to get the content inaccessible (if that is desired), and plan a proper sunset in 3-6 months.

This option is strongly recommended for website rebuilds where comparing new and old content side by side for a short time is desired.

IS&T Contributions

Documentation and information provided by IS&T staff members


Last Modified:

December 22, 2023

Get Help

Request help
from the Help Desk
Report a security incident
to the Security Team
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
Feedback
This product/service is:
Easy to use
Average
Difficult to use

This article is:
Helpful
Inaccurate
Obsolete
Adaptavist Theme Builder (4.2.3) Powered by Atlassian Confluence 3.5.13, the Enterprise Wiki