How to archive a CMS powered website to static HTML

When switching the content management system of my website from ProcessWire to WordPress, I want to archive the previous website, because it contains some content that I want to keep and stay accessible.

In this post I’ll describe how to do it. Essentially, it is one single wget command with optional post-processing by sed .

For a real-world example, my goal is to create a collection of locally browsable HTML files from https://processwire2015.stut.de . I want to shut down that CMS but still publish the old website under a subdomain.

I’m doing this on a Linux (Ubuntu 22.04, but most Linuxes, as well as macOS, have the tools mentioned) command line. The go-to tool for ripping an entire website in Linux is wget . (wget is also available for Windows, see https://gnuwin32.sourceforge.net/packages/wget.htm and https://www.tomshardware.com/how-to/use-wget-download-files-command-line .)

I’m creating a blank subdirectory ~/processwire2015-static . All work is being done in this subdirectory.

1. Collect the Right wget Parameters

wget is a very versatile tool with lots of options. So I went through the documentation and collected the parameters for my use case:

  • follow links : -r
  • … but only within this domain: --domains=processwire2015.stut.de,www.stut.de,stut.de
  • write relative links, suitable for local viewing (as the original will go away): --convert-links
  • create files as .html, even is the URL ends with something else: --adjust-extension
  • also get page requisites like CSS: --page-requisites
  • don’t create subdirectories per host, to avoid the font being stored in a separate subdirectory: --no-host-directories

so the complete command line is

wget -r --domains=processwire2015.stut.de,www.stut.de,stut.de --no-host-directories --convert-links --adjust-extension --page-requisites https://processwire2015.stut.de

This runs for a minute or so and generates a collection of files and subdirectories. The number of separate index.html files comes from the URL composition of ProcessWire: For instance, the „english“ page has the URL /english , which is not a good filename. So wget creates a subdirectory of this name and an index.html file with the real contents.

The Linux tree command lists a nice overview:

.
├── css?family=Lusitana:400,700|Quattrocento:400,700.css
├── deutsch
│   ├── bueroservice-marion-stut
│   │   └── index.html
│   ├── dienstleistung
│   │   └── index.html
│   ├── erfahrungen
│   │   └── index.html
│   ├── index.html
│   └── lebenslauf
│       └── index.html
├── english
│   ├── cv
│   │   └── index.html
│   ├── experience
│   │   └── index.html
│   ├── index.html
│   └── services
│       └── index.html
├── index.html
├── kontakt-impressum
│   ├── datenschutzerklaerung
│   │   └── index.html
│   └── index.html
├── links
│   └── index.html
├── site
│   ├── assets
│   │   └── files
│   │       ├── 1025
│   │       │   └── passbild-martin-web.jpg
│   │       └── 1034
│   │           └── passbild-martin-web.jpg
│   └── templates
│       └── styles
│           └── main.css
└── site-map
    └── index.html

2. Postprocessing – Cleanup with sed

Inevitably, some manual cleanup will be needed.

Remove the Admin Login Page Link Target

All pages generated by the old CMS had a link „Admin Login“ at the bottom, pointing to the /admin page. As the CMS will go away it is pointless to have a link to the no-longer-existing admin page. So let’s point the link to the start page.

As all files need to be edited, this task cries for automation. One of Linux’s go-to tools for mass replacement of text file content is sed . (awk is another tool capable of this, but sed is a lot simpler and I wanted to learn it at this opportunity.)

Because I haven’t used sed for quite a time, I searched the web for examples and found https://tecadmin.net/sed-command-in-linux-with-examples/ . For a single file the command line is:

sed -i 's/adminpage//' index.html

The -i option copies the result in-place to the file being edited, as opposed to writing it to standard output. The s/adminpage// command does a substitution of the regular expression adminpage to the empty string (between the last two slashes). The single quotes just serve to protect the command from being expanded by the shell.

To apply this command to all files, I’m using the find command:

find . -type f -exec sed -i 's/adminpage//' {} \;

The . means „start at this directory“; type -f means „only look at files“ as opposed to directories, symbolic links, devices, …; -exec means „execute the following command for each file found“; the {} in the command is a placeholder to insert the filename to be processed. The \; at the end is a plain semicolon to signify the end of the -exec command, with the backslash protecting the semicolon from the outer shell.

Then of course remove the /adminpage directory.

Remove the Admin Login Page Link Text

The same sed logic does the trick here, replacing „Admin Login“ with nothing:

find . -type f -exec sed -i 's/Admin Login//' {} \;

Remove the Search Form Action

The old CMS‘ template also placed a search field (miniature form) onto each page. This search form won’t work with a static copy. Because it would have been too much effort to remove the search form, I decided to just remove the action: replace https://processwire2015.stut.de/search/ with nothing:

find . -type f -exec sed -i 's/https:\/\/processwire2015.stut.de\/search\///' {} \;

The sed command is suffering from „leaning toothpick syndrome“, because the literal forward slashes in the search text need to be protected by preceding backslashes from being interpreted as „end of regular expression“ marks.

Conclusion

You have now an adaptable recipe for archiving an entire CMS-driven website to a static collection of HTML files using the wget command, and postprocessing it using the sed command.

This example task demonstrates some of the power of the command line. Explaining the same process for a graphical user interface would be much more complex.


Kommentare

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert