Q: Archiving a Website (Drupal Cloud and Managed Server)
If you have rebuilt a website on another service, or are retiring a website, and you request an archival copy of the old website, you will want to consider what archive formats are available and how this archive might need to be restored in the future.
Context
The focus for this is for active content websites (such as WordPress or Drupal), where the content includes:
- Configurations managed on the server
- The code shipped as part of the CMS
- Files uploaded by users of the CMS into the docroot
- (possibly) Custom themes and modules written for the website
- A database (usually MySQL) containing most of the website's page contents and configurations
Answer
Database Dump and Docroot
Providing a dump of the website's database, alongside a tar or zip archive of the website's file space, provides all the data needed to resurrect the website somewhere else.
Pros:
- This is the only format that gives you essentially the complete website that can be launched somewhere else.
- This is the only format that will fully preserve structured data in a format that can be read programmatically.
Cons:
- In order to read the contents in any human-readable form, you will need a LAMP server or hosting environment to run it in. Most people with the title "web developer" should be able to accomplish this.
- While re-launching the website within a few months is possible, bringing back online a website that is years old is often fraught with security and versioning concerns, and becomes a much larger project. (Expect that within 2-5 years, restoring this dump will be difficult and expensive.)
- Drupal Cloud in particular won't be able to re-launch a website from a dump you provide; for consistency of the service we insist that Drupal Cloud sites be updated in lock step and do not accept outside database edits.
- Handle the dumps carefully: any user accounts within the CMS will have passwords that are hashed and saved in the database dump, and can potentially be cracked if the dump is exposed.
Server Backups
The normal process of retiring a Managed Server involves archiving the server's final TSM backup for one year post-decommission; additionally, the server's image will be retained in a bootable state for one month. Note that recovering a website from this backup is comparable in work to re-hosting the website with us, and includes most of the disadvantages of recovering from a docroot and database dump backup.
Use print to PDF or wget to crawl a website
If the website is only a few pages using your browser's print to pdf or save page as is a perfectly viable way to save reference copies. This will allow you to view the files locally and search for text within them.
If the website has many pages, you might be able to use wget to get a copy of the site. Search online for more information about wget. A reasonable invocation is:
wget --recursive --convert-links --page-requisite --no-parent "http://website.example.url/"
Since PDF and HTML files 20+ years old are still easily accessible to modern clients, you can expect that this kind of copy is reliably archivable.
Use web.archive.org (third party service)
You can request a snapshot kept on web.archive.org (if the site hasn't already been crawled naturally). This is a third party service that we're not affiliated with.
Keep the website online as a -archive site
Depending on the situation, it's often reasonable to rename the website, and leave it active, while hiding it from search engines. Use the website's maintenance mode to get the content inaccessible (if that is desired), and plan a proper sunset in 3-6 months.
This option is recommended for website rebuilds where comparing new and old content side by side for a short time is desired. Where the server or service hosting the old site may be decommissioned we will cap lifespan to six weeks.
![]() | DrupalCloud Service Sunset The entire DrupalCloud service is expected to be decommissioned in April 2025. This deadline limits the lifespan of a archived site on that service. We'll be removing lingering archive sites well ahead of the final deadline, to ensure that we can arrange a reprieve if content turns out to still be wanted. |