How handling 404s well, after migrating from Magento 2 to Laravel, can save you money

You can either scale horizontally OR you can optimize your nginx config for handling of missing files.


So what's the big issue here? Bots!

As always. Be it good bots or bad bots. And for the sake of simplicity in this article, I put all crawlers and spiders and bots in the same category: bots.

See, some bots have databases of websites with already indexed URLs, which they check themselves or give that data away for others to check.

That means: the longer your website exist, and the larger it is, the bigger those databases are getting, and the more traffic you'll get. Google, Bing, DuckDuckGo, and many other search engines do belong to this category too, and they have big databases too.

So what happens when you migrate a big Magento 2 system with multiple subshops with multiple languages, and at least >40k unique product pages? Right, those will be checked again and again. Even a year after this migration, the bots are still coming for the old data.

Hopefully you've created redirects for all the product urls. We did. But in our case it's the static files that caused us some trouble, because in reality every unique product page consist of a bundle of html + images + documents + movies. Thousands upon thousands of requests for old static files in old Magento 2 locations, which got moved to new places with a new directory structure. Access logs showed that bots frequently request those images. And bad bots are oftentimes even more aggressive in requesting dozens of old files at the same time. And now add to the mix even older files which those nasty bots still remember but did not exist even in the old system.

And what's the problem? Wrong configuration!

See, the default Laravel nginx config or the default Apache .htaccess config tries to resolve all urls first as static files. If that fails it forwards the request to the application itself. I've checked some other applications/frameworks/cms systems, and many have the same issue.

Now imagine having thousands upon thousands of requests for old static files per day, peaking at dozens requests per second at times, which your application has to process. Now imagine having thousands upon thousands of requests for old static files per day also resulting in database calls, new sessions (since every request without cookies creates a new session, and bots usually don't send cookies, especially not already provided and fresh ones). You quickly will have database issues, memory issues, cpu load issues, until your server breaks down. Call it an unscheduled load test by unhired bots, "for free".

And what's the solution? It's actually quite simple: Configure static files handling properly

First, every access to old static folders should result in an immediate 404 - not found, without any further checking. This can look like this example in nginx:

server {
    ...
    location ~* /(media|static|pub)/ {
        access_log off;
        log_not_found off;
        return 404;
    }
    ...
}

Second, every access to common static files likes images and documents should be checked without calling the application as fallback.

server {
    ...
	location ~* ^.+\.(jpg|jpeg|gif|png|bmp|webp|ico|svg|tiff|css|js|txt|json|map|mp3|wma|rar|zip|flv|mp4|mpeg).*$ {
		access_log off;
		log_not_found off;
		try_files $uri =404;
	}
	...
}

Third, general access to files inside certain directories should be checked without calling the application as fallback.

server {
    ...
	location ~* /(storage|css|js|vendor|images)/ {
		access_log off;
		log_not_found off;
		try_files $uri =404;
	}
	...
}

Of course you have to finetune it. If for example you generate images on the fly, and the URL includes the file extension already, that has to be excluded from this. But even in that case you can do it the smart way and exclude only those URLs, otherwise it should not go to the application if it's not found.

And it could be helpful to have a nice 404 page, for those rare occasions where someone really wants to request that file in his browser. But nginx has got you covered, simply put this in your config file, and you are good to go:

server {
    ...
    error_page 404 /404.html;
    ...
}

You can go all crazy with even more options for your error page, simply dive into the official documentation page on error pages.

Make some tests

You would be surprised how many sites out there have this issues, wether it's based on Laravel, or some other framework or CMS. Just pick 10 random sites, request some bogus url like /this-is-just-a-test.png, take a look at the time it takes for the server to return the 404 site, see wether new cookies are set, and compare that with a request to a real static file, like /robots.txt. Heck, even laravel.com and spatie.be have this issue.

I've seen websites where I make a request to https://some-domain.com/this-is-a-bogus-test.png, get redirected to https://www.some-domain.com/this-is-a-bogus-test.png, and get another redirect to www.some-domain.com/de/this-is-a-bogus-test.png, hitting 3x the application, with 2x redirects, served from the application, and one final 404 page, served from the application. We all can do better.

Is it really that bad? Well, you "could" save some money

It depends. If you do a migration of a huge ecommerce page, then it could be an issue. If you are just starting out with a new domain, it's probably no issue at all. If you are at the point of thinking about scaling horizontally due to traffic, then it could really be useful to do some analysis.

In this case it saved us a ton of money and infrastructure complexity. Before I dug deep into analyzing this stuff, our infrastructure engineers recommended a common quick solution to performance issues like this: scale horizontally.

But I rather go down that rabbit hole of performance optimizations, due to economical reasons, and for the pure dopamine kick of seeing those metrics go down due to my own optimizations.