America In Transition: Crawling Gotcha: Rel=”Canonical” Creating 404’s or Weird Redirects for Google

Tuesday, October 4, 2016

Crawling Gotcha: Rel=”Canonical” Creating 404’s or Weird Redirects for Google

rel-canonical This issue:

Site crawls tend to be a starting point of a lot of SEO reviews. One thing that I’ve been increasingly running into is Google flagging server response errors to pages the never existed in the first place in Google Search Console but I don’t see the same error when running other crawlers.

One of the ways that Google is finding these pages is not by crawling internal links of a site, rather they’re crawling the link within rel=”canonical”.

Some of the urls you typically see referenced rel=”canonical” in to pages that don’t exist are the following:

+ parameters (including trailing parameters) in canonical
+ double trailing backslashes
+ no trailing backslashes
+ Incorrect protocol (http: vs. https:)

So, Google finding these errors but they may be invisible to you.

How big of a deal is this:

In general, Google ignores incorrect implementations of rel=”canonical”. For smaller sites, this is not significant for larger websites that rely on rel=”canonical” for crawl control this becomes important, especially if you have a lot of duplicate content.

For a large site, this could be a big win over duplicate content with a small project. Although purists may disagree with me, for smaller site without a ton of duplicate content, this is not a big issue.

How to find this issue on your website:

Before you run the check

Since we’re trying to replicate what Google is seeing, there are a bunch of steps to surface this issue. I included the general steps below. It about 30 minutes to generate the report below if you’re familiar with your crawl tool. Also, the method below will work with any modern crawl tool. With that said, if you have specific questions about how to crawl of this issue with the crawler you’re using, feel free to drop me a note in comments.

The steps to check if you’re generating server errors through your rel=”canonical”’s

Step 1: Use your favorite crawler solution to crawl your site. Some crawlers include the contents of rel=”canonical” in a column other crawlers allow you to extract content as a custom feature.

Step 2: Grab an export, and grab the contents of the URLs referenced in rel=”canonical” and copy them to clipboard

Step 3: Run the contents of rel=”canonical” through your crawler

Step 4: Your crawler will now tell you which canonical errors may be generating server errors for Google by sending Google to canonical pages that never existed.

Step 5: Use VLOOKUP to map the rel=”canonical” errors back to your first spreadsheet, from your first crawl, generated in step 1 & step 2.

Step 6: Now you have a list of pages with canonical errors that need to be fixed.

The fix:

Update the rel=”canonical” reference to correct location, yourself in the CMS if you’re able but if you’re managing a large website, more than likely, your rel=”canonical”s are created dynamically. In the cases where you can’t fix the canonical issues yourself and web development support is needed, you’ll have a spreadsheet mapping the errors to fix.

Happy SEO cleanup