There’s a Lot You Can Learn From 404s
So I made some changes to one of my WordPress sites recently that updated some of the URLs for some very old posts on the site. Nothing would have broken on the site itself, but if someone had posted a link directly to a post a long time ago, it could now lead to a 404 error. In fact, any reasonably-aged site could have broken links like this as things change over time.
So I thought about tracking the 404 errors with a simple script. Sure I could have accessed the server’s logs to do this. But I thought it would be easier to add some code to the 404.php page on the site and have a simple log of just the 404 errors, with just the information I needed.
Here’s the PHP code:
< ? $url = $_SERVER["REQUEST_URI"]; $agent = " "; $ua = $_SERVER['HTTP_USER_AGENT']; if (strpos($ua,"Googlebot") !== FALSE) $agent = "go"; if (strpos($ua,"bingbot") !== FALSE) $agent = "bi"; if (strpos($ua,"DuckDuckBot") !== FALSE) $agent = "du"; if (strpos($ua,"YandexBot") !== FALSE) $agent = "yb"; if (strpos($ua,"Yahoo! Slurp") !== FALSE) $agent = "ya"; if (strpos($ua,"Baiduspider") !== FALSE) $agent = "ba"; if (strpos($ua,"Sogou") !== FALSE) $agent = "so"; if ($f = fopen(ABSPATH."404log.txt","a+")) { fwrite($f,date("ymd H:i:s")."\t". $agent."\t". $url."\n"); fclose($f); } ?>
So the bulk of the code looks at the user agent and tries to tag the biggest spiders out there. I wanted to know if any of these were still looking for old links. Having a simple “go” for Google and “ya” for Yahoo was a lot easier to read than the whole big user agent string.
So then I looked at this file after a few hours. I could see that some renamed URLs were going to be a problem, so I added rewrites for those in the .htaccess file. But I also uncovered some interesting things.
First, I noticed calls for an “apple-app-site-association” file, which I was missing. Turns out this is used when an app accesses URLs on a server. I have an app that does this, but for some reason I never realized that this special file would make things work more smoothly. Here’s all the info you need about it: Support Universal Links.
Also, there were hits for apple-touch-icon, which you can read about here. Look at the section entitled “Look ma, no HTML!” to see how to make these without adding anything to your web pages. The important thing to note is that anyone can turn any web page into a bookmark that appears on the home screen of an iPhone or iPad. You don’t have to have any app, developer relationship with Apple or anything. And a home screen bookmark will look for and try to use a variety of apple-touch-icon files.
I also noticed lots and lots of hackers/bots trying to do things they shouldn’t. They are trying to directly access things inside of plug-ins that I don’t have. I can imagine that they are trying to gain access via exploits. I’m learning a lot about what they are trying to do by just looking at what they are trying to access.
On a happier note, I saw a few 404 errors for things in a “.well-known” directory. That led me to learn about about Well-Known Uniform Resource Identifiers.
It was definitely worth a look to see which URLs were creating 404 errors. I’m continuing to gather them in a log to see what else I may find.