When switching the content management system of my website from ProcessWire to WordPress, I want to archive the previous website, because it contains some content that I want to keep and stay accessible.
In this post I’ll describe how to do it. Essentially, it is one single wget
command with optional post-processing by sed
.
For a real-world example, my goal is to create a collection of locally browsable HTML files from https://processwire2015.stut.de . I want to shut down that CMS but still publish the old website under a subdomain.
I’m doing this on a Linux (Ubuntu 22.04, but most Linuxes, as well as macOS, have the tools mentioned) command line. The go-to tool for ripping an entire website in Linux is wget
. (wget is also available for Windows, see https://gnuwin32.sourceforge.net/packages/wget.htm and https://www.tomshardware.com/how-to/use-wget-download-files-command-line .)
I’m creating a blank subdirectory ~/processwire2015-static . All work is being done in this subdirectory.
1. Collect the Right wget
Parameters
wget is a very versatile tool with lots of options. So I went through the documentation and collected the parameters for my use case:
- follow links :
-r
- … but only within this domain:
--domains=processwire2015.stut.de,www.stut.de,stut.de
- write relative links, suitable for local viewing (as the original will go away):
--convert-links
- create files as .html, even is the URL ends with something else:
--adjust-extension
- also get page requisites like CSS:
--page-requisites
- don’t create subdirectories per host, to avoid the font being stored in a separate subdirectory:
--no-host-directories
so the complete command line is
wget -r --domains=processwire2015.stut.de,www.stut.de,stut.de --no-host-directories --convert-links --adjust-extension --page-requisites https://processwire2015.stut.de
This runs for a minute or so and generates a collection of files and subdirectories. The number of separate index.html files comes from the URL composition of ProcessWire: For instance, the „english
“ page has the URL /english
, which is not a good filename. So wget creates a subdirectory of this name and an index.html
file with the real contents.
The Linux tree
command lists a nice overview:
.
├── css?family=Lusitana:400,700|Quattrocento:400,700.css
├── deutsch
│ ├── bueroservice-marion-stut
│ │ └── index.html
│ ├── dienstleistung
│ │ └── index.html
│ ├── erfahrungen
│ │ └── index.html
│ ├── index.html
│ └── lebenslauf
│ └── index.html
├── english
│ ├── cv
│ │ └── index.html
│ ├── experience
│ │ └── index.html
│ ├── index.html
│ └── services
│ └── index.html
├── index.html
├── kontakt-impressum
│ ├── datenschutzerklaerung
│ │ └── index.html
│ └── index.html
├── links
│ └── index.html
├── site
│ ├── assets
│ │ └── files
│ │ ├── 1025
│ │ │ └── passbild-martin-web.jpg
│ │ └── 1034
│ │ └── passbild-martin-web.jpg
│ └── templates
│ └── styles
│ └── main.css
└── site-map
└── index.html
2. Postprocessing – Cleanup with sed
Inevitably, some manual cleanup will be needed.
Remove the Admin Login Page Link Target
All pages generated by the old CMS had a link „Admin Login“ at the bottom, pointing to the /admin
page. As the CMS will go away it is pointless to have a link to the no-longer-existing admin page. So let’s point the link to the start page.
As all files need to be edited, this task cries for automation. One of Linux’s go-to tools for mass replacement of text file content is sed
. (awk
is another tool capable of this, but sed
is a lot simpler and I wanted to learn it at this opportunity.)
Because I haven’t used sed
for quite a time, I searched the web for examples and found https://tecadmin.net/sed-command-in-linux-with-examples/ . For a single file the command line is:
sed -i 's/adminpage//' index.html
The -i
option copies the result in-place to the file being edited, as opposed to writing it to standard output. The s/adminpage//
command does a substitution of the regular expression adminpage
to the empty string (between the last two slashes). The single quotes just serve to protect the command from being expanded by the shell.
To apply this command to all files, I’m using the find command:
find . -type f -exec sed -i 's/adminpage//' {} \;
The . means „start at this directory“; type -f
means „only look at files“ as opposed to directories, symbolic links, devices, …; -exec
means „execute the following command for each file found“; the {}
in the command is a placeholder to insert the filename to be processed. The \; at the end is a plain semicolon to signify the end of the -exec command, with the backslash protecting the semicolon from the outer shell.
Then of course remove the /adminpage directory.
Remove the Admin Login Page Link Text
The same sed
logic does the trick here, replacing „Admin Login“ with nothing:
find . -type f -exec sed -i 's/Admin Login//' {} \;
Remove the Search Form Action
The old CMS‘ template also placed a search field (miniature form) onto each page. This search form won’t work with a static copy. Because it would have been too much effort to remove the search form, I decided to just remove the action: replace https://processwire2015.stut.de/search/
with nothing:
find . -type f -exec sed -i 's/https:\/\/processwire2015.stut.de\/search\///' {} \;
The sed command is suffering from „leaning toothpick syndrome“, because the literal forward slashes in the search text need to be protected by preceding backslashes from being interpreted as „end of regular expression“ marks.
Conclusion
You have now an adaptable recipe for archiving an entire CMS-driven website to a static collection of HTML files using the wget
command, and postprocessing it using the sed
command.
This example task demonstrates some of the power of the command line. Explaining the same process for a graphical user interface would be much more complex.
Schreibe einen Kommentar