Static site hosting hurdles -- Volution Notes

Context

It seems that static sites are (for some time actually) back from the dead and here to stay for good (or until the next lap on the fashion treadmill).

For what a "static site" is I would usually point to Wikipedia, but this time on both topics of static sites and generators the content is so out-of-date that it's actually funny; see the static web page and static site generators articles. Thus, I'll have to resort to linking some marketing blogs for a more elaborate explanation: What is a static site generator? (from Cloudflare), and that's it! I would have loved to point to something from Netlify (but that blog is full of listicles) or from GitHub Pages (but that documentation already assumes one knows what is doing)... Thus, all I can do is point to a search for what is a static website on DuckDuckGo... (If anyone has a good neutral introduction on the topic please send me a link.)

Anyway, in technical terms a static site is just a bunch of files on the file-system that are served with a simple HTTP server which when loaded via a browser present the user with something one can read (or listen to if accessibility is taken into consideration).

Also still related to static sites, there is also the concept of Jamstack which although it has the same technical low-level underpinnings serves a completely different purpose. While "static sites" usually mean sites meant for consumption as reading material (thus they consist mostly of HTML for content, CSS for styling, and a pinch of JavaScript for eye-candy), "jamstack sites" usually mean web applications meant for consumption through interaction (thus they consist mostly of JavaScript for logic, CSS for styling, and a pinch of HTML for bootstrapping everything). There must also be said that some sites meant for reading are actually served as web applications, if for no other reason, because there is never enough blinking-flying-dancing eye-candy in a site; oh, and ads!

As for examples, the very site you are reading is such a static site! (There are countless out there, just search on the internet.)

A small overlooked detail

In this article I don't want to focus on what a static site (or generator) is, but on an ancillary topic that is most often overlooked, but which is essential to the static site.

So far, if we ignore the topic of static site generators, all seems so simple it feels like we are back in the early 2000s. We have our HTML, CSS, and JS files somewhere on the file-system (hopefully in a version control system like Git) and all we need to do is serve them to our users.

Which leads me to that detail I say it's often overlooked:

How to serve a static site?

The short but surprising answer is that although we imagine and operate as if we actually were in the early 2000s, that's far from the real requirements in 2022...

A few hosting solutions

Depending on what era one has started playing with computers the answer might range from:

the seasoned graybeard Linux/BSD user usually sticks with (in reversed order of popularity) OpenBSD httpd, lighttpd, Apache, Nginx, or any of the other well-established but lesser-known HTTP servers (see Debian packages or OpenBSD ports for alternatives);
alternatively, the same graybeard might also lean towards some simpler HTTP servers (because in the end one only needs to support GET from the file-system) and choose something like darkhttpd, webfs, one of the suckless.org promoted HTTP servers (e.g. mini_httpd, thttpd), and many other similar servers written in C;
(although not recommended for production) the Python developer could choose http.server or uWSGI (which is acceptable in production), the NodeJS could copy-paste from how to serve static files or use github.com/vercel/serve (which is also acceptable in production), and the Ruby (and other scripting language) developers could try to find similar snippets in their favorite ecosystem;
the Rust aficionado could choose github.com/svenstaro/miniserve or github.com/joseluisq/static-web-server; meanwhile the Go aficionado could choose the next item on the list, something like github.com/PierreZ/goStatic, or any of the myriad Go-based HTTP servers out there for example see GitHub topics on web-server or http-server;
another newer alternative is the "do it all" (from compression to LetsEncrypt certificates) Caddy;
I, because I always like to be different, have my own Kawipiko Go-based blazingly fast static HTTP server; :)
the cloud operator, given that all of the above choices are self-hosted (or at least require some system administration skills), would immediately go for a hosted solution like CloudFlare Pages, Netlify, Vercel, or one of the source code hosting sites that also have static site hosting solutions like SourceHut pages or GitHub pages;
and, in addition to one of the above, for performance reasons one would certainly also use a CDN (Content Delivery Network) like CloudFlare; I would have linked other options here, but none I could find (whose name I recognize) are free;

Looking at the above list one surely starts to see the problem: sure, static sites seem simple enough, but when it comes to serving (and as said I'll ignore for the moment the generation) things quickly become complicate enough.

As an example, here are a few recent articles on how to serve a static site:

Self-hosting a static site with OpenBSD, httpd, and relayd;
Hosting my static sites with nginx;
Installing and Configuring Nginx on a Linux Home Web Server;
Self-hosting static websites (which touches upon multiple options);
Hosting a static site on Fly.io with Nix and Caddy (an entire quest involving Nix, an interesting approach especially for learning);

So, we (as an industry) have managed to streamline the authoring and compilation of static sites -- usually by employing a workflow that relies on Markdown and generators -- but we have yet to streamline a way to serve these sites, especially in the self-hosted scenario. Although in the cloud-hosted scenario things certainly look more streamlined -- because most of the magic happens inside their own infrastructure -- there sure are inconsistencies and issues; for example as discovered in this article about trailing slashes on URLs.

The deployment checklist

Thus, after authoring the content and generating the HTML, CSS, JS and other asset files, one needs to actually serve these to the users. And based on the previous hosting options, one has to think (at least some) of the following tasks:

choose a self-hosted HTTP server (and where to run it) or a cloud-hosted provider; (in case of self-hosting, given that internet connectivity is an important aspect, most likely a VM hosted in a cloud provider is the best solution, such as Linode, Hetzner, or the innovative Fly.io;)
serve the site over https://:
- (all of the following are needed especially in the self-hosted scenario, meanwhile in the cloud-hosted one are mostly solved by the provider itself;)
- check that the server actually supports TLS (most well-known servers do, but others don't), else use a TLS terminator or reverse-proxy (like HAProxy);
- check that http:// requests are redirected to https://;
- check that the TLS configuration is up-to-date with the latest industry recommendations (like those provided by Mozilla);
- issue and renew TLS certificates; (somewhat automated in well-known servers;)
- check that the TLS security related response headers are up-to-date with the latest industry recommendations; (see the next item for details;)
check that the content security related response headers are up-to-date with the latest industry recommendations (like those provided by OWASP); (some cloud-hosted solutions provide some of these, as do some CDN solutions; however, most of the self-hosted solutions require one to manage these;)
perhaps check that the server supports HTTP/2 (and HTTP/3), else use a CDN or reverse-proxy (like the previously mentioned HAProxy);
serve the content with proper Cache-Control directives;
serve the content compressed (with gzip or even brotli); (most cloud-hosted solutions do this, some self-hosted solutions have support for this, many simpler servers don't;)
serve the HTML, CSS, JS -- perhaps even SVG, JSON, XML and other plain text serialization formats -- minified; (most cloud-hosted solutions do this, very few self-hosted solutions even support this;) (luckily most generators already support this, else one could use github.com/tdewolff/minify as an extra step in the publication workflow);
bundle multiple CSS or JS files together (which helps with compression and reduces requests count;) (most cloud-hosted solutions support this;)
check that with/without .html extension serving works properly; (i.e. /some-file.html should perhaps return the same content as /some-file, or even better, choose one as canonical and redirect for the other one;)
check that with/without slash redirects work properly; (i.e. /some-folder should redirect to /some-folder/, and /some-file/ should redirect to /some-file);
perhaps check that there aren't any dead links; (one could use github.com/raviqqe/muffet or github.com/lycheeverse/lychee for this;)
perhaps provide a /robots.txt;
perhaps provide a /sitemap.xml; (usually generated by the static site generator;)

That's simple enough, right? :)

When simple doesn't hold

The point I'm trying to make is that although static sites seem a simple enough concept, and letting aside that generating it (i.e. compiling from the actual sources) isn't, when it comes to hosting in 2022, it's as much as a hurdle as is writing and running a proper web application server.

Moreover, many of these options can't be configured in one place -- some are configured in the HTTP server, some in the CDN, etc. -- and certainly can't be diff-ed, tracked or even easily tested.

As discovered in the trailing slash article, even moving from one cloud-hosted provider to another can break one's site if one expects URLs to look a certain way.

Let's set all this aside for the moment and think about the workflow up-to the moment of deployment. One can easily track (and diff) and snapshot the content source (given that most likely it's a bunch of Markdown files); one can easily do the same for the assets and generator configuration; one can even do the same for the resulting generated files (although diff-ing doesn't work so well due to minification). But once we are ready to deploy the site, one doesn't actually know what he'll get without actually testing to check if the responses one gets are the responses one expects (let alone being sure they'll be the same two years from now).

A better alternative

However what has been described above doesn't have to be so unreliable. How about if we add yet another step in the static site generation:

Generate the complete HTTP resource set!

No, I'm not mad, what I'm proposing is to:

generate a complete set of HTTP GET requests that are acceptable (actually a complete list of URLs for such resources); this set should include even redirects;
for each of the resource, generate the complete HTTP response (including the headers and body), perhaps with support for alternative Content-Encoding;
explicitly include for the response all the headers, from content type and encoding, download disposition, location for redirects, caching, up to the content security and TLS hardening headers;
perhaps add support for multiple "catch-all" responses, like a global 404 response, or deeper wildcard pages (see Netlify's redirect and rewrite shadowing about a possible solution;)

"Wasteful this is!" I hear you shout: we are wasting storage space for the same repetitive headers, wasting storage for multiple variants differing only in compression, wasting CPU to compress resources that might never be accessed.

Fortunately that's not the case. With a smart enough storage format all of these could be solved:

repetitive HTTP headers (either individually or as a set) could be stored only once and referenced from all the requests that need them; (one could also cleverly encode them, given that the headers and their frequently used values are already known;)
repetitive HTTP bodies could also be stored only once; (most sites don't have duplicated resources, however in some corner-cases there might be, like for example static dumps of dynamically generated sites;)
compression could be done only once when a new body is seen, caching compressions for repeated use; (not to mention one could turn the compression to 11;)
besides compression, one could also apply minification at this phase, and even image optimization;

Thus, what we are left with is a compiled HTTP resource set that we should be able to track, diff (with the right tool), snapshot and test offline.

Once one is ready for deployment, the hosting check-list is reduced only to these few items:

choose the self-hosted server (and where to run it) or a cloud-hosted provider;
TLS aspect of https:// (i.e. certificates and TLS configuration); (meanwhile the security related headers are already built-in into the resource set, and given one uses the proper Content-Security-Policy header, even the redirects can be taken care of by the browser;)
that's it; enjoy the site!

In fact this isn't even an original idea as there already exists Kiwix, although it has a different use-case, that of bundling reading material for offline or disconnected use (including a snapshot of Wikipedia).

Putting it in practice

Remember when I've said that I have my own way of serving static sites?

Well, I've just implemented (some) of the above in my own static site "archiver" and "server": Kawipiko. (That link also has the documentation about installation and usage, plus some implementation details and benchmarks.)

Want to see it in action?

see the demo site; it runs on desktop-grade (old) hardware, through a residential fiber connection, and although it's served through CloudFlare it's instructed not to apply caching at the edge;
this very site you are reading uses it; :)
(plus a few other production deployments;)

Not only is it secure (setting aside security issues that are found in the few dependencies it relies-upon) but it's also extremely fast, at least on-par with tuned Nginx and even better for some use-cases. How fast? ~70K to ~100K requests per second on my old laptop.

With regard to storage, the demo site is composed of ~350K resources, totaling ~4.5 GiB, which when compressed results in a single ~1.8 GiB archive. That archive contains besides the actual files also ~350K redirects (with/without slash). Regarding headers, there are ~30 unique headers for all files, and the one for each redirect.

Concluding remarks

I'm not saying that my own Kawipiko implementation is the correct or best approach.

However, I think we should try (or at least think about how) to standardize the hosting side of static sites, because in the end this is not the early 2000s. In that era index.html present in the URL was acceptable, and besides redirects a "webmaster" didn't need to know more about HTTP, or TLS, or security, or CDN's, or mobile experience, or Google's damn scoring system...

Also, what I'm proposing isn't even new, as mentioned earlier, Kiwix already employs a similar solution, and there exists even redbean which takes things to the extreme by embedding a zip file inside an executable that works on different operating systems (talk about portability).

(Followup)

A week after writing this article, I've wrote a small followup detailing how one can leverage Linux's seccomp to strengthen the security of such a HTTP server like Kawipiko.