17

Oct

Filed in Code, Django, SEO |

The last few weeks have had a lot of work related to SEO/SEM both at iBegin and on random sites of my own, like Nibbits. On iBegin we’ve been tweaking our robots handling, meta descriptions, and titles, all to try and get the most out of Google. We’ve also found we’ve needed to do much more than that due to some quirks with Google. At Nibbits, it’s some of the same stuff, but less specific. I use it as more of a benchmark for how things are improving.

So, what have we done?

Change temporary redirects to permanent

One of the first headaches I’ve had recently, was the discovery that Google doesn’t really understand, or care, about how your redirects work. It wants you to have unique pages, and not references to those pages. On iBegin, to avoid extra query overhead, we initially had a URL which was /go/slice_ID/key_ID/ for forwarding to a businesses full URL. This was running your standard 302 redirect, which is the common status code for a temporary redirect.

To sum it up, Google complained about these pages being copies of the pages they’re actually redirecting too (rather than references), so we ended up replacing all indexable redirects with permanent redirects. Anywhere possible, we added more denormalization hooks just to avoid these kind of redirects all together.

Some relevant code:

class ChangeRedirectsMiddleware(object):
    """Because Google sees a temporary redirect as a copy of the page,
    instead of a reference, we must force it to be a permanent redirect."""
    def process_response(self, request, response):
        if 'googlebot' in request.META['HTTP_USER_AGENT'] \
                and response.status_code in (302, 307):
            response.status_code = 301
        return response

Holy shit sitemaps work

One of the many things introduced over the last couple of years, were the Google sitemaps. I myself had never implemented these on a website until recently simply due to the fact that I never noticed an indexing problem. Let me tell you this has been one of the worst decisions I have ever made. They’ve been great for keeping iBegin’s millions of businesses in Google’s index, as well as immediately adding almost all of Nibbit’s pages to Google.

For a rough example, Nibbits was up for about 3 months, and had about 3,000 pages in Google. I added sitemaps to it and within a week it jumped to 40,000 pages and growing.

Django providing a nice contrib application for working with sitemaps, but it’s not quite good enough for all use-cases, like Yahoo crawling 200 sitemaps all at once that aren’t pre-generated. We ended up rolling out or own generate_sitemaps management command, which writes them to disk. At the moment we haven’t implemented any code to add or update changed entries.

Here’s a rough implementation (a bit old) of how ours was working. As you will see, I attempted to use generators to avoid the very scary amount of memory which was being used to generate these.

class SitemapGenerator(object):
    def __init__(self, site, page):
        self.site = site
        self.page = page
 
    def get_attr(self, site, name, obj, default=None):
        try:
            attr = getattr(site, name)
        except AttributeError:
            return default
        if callable(attr):
            return attr(obj)
        return attr
 
    def get_urls_from_site(self, site, page):
        current_site = Site.objects.get_current()
        page = site.paginator.validate_number(page)
        bottom = (page - 1) * site.paginator.per_page
        top = bottom + site.paginator.per_page
        if top + site.paginator.orphans >= site.paginator.count:
            top = site.paginator.count
        for item in site.paginator.object_list[bottom:top]:
            loc = "http://%s%s" % (current_site.domain, self.get_attr(site, 'location', item))
            url_info = {
                'location':   loc,
                'lastmod':    self.get_attr(site, 'lastmod', item, None),
                'changefreq': self.get_attr(site, 'changefreq', item, None),
                'priority':   self.get_attr(site, 'priority', item, None)
            }
            yield url_info
 
    def __iter__(self):
        yield "<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
        yield "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">"
        try:
            site = self.site
            if callable(site):
                site = site()
            for url in self.get_urls_from_site(site, self.page):
                yield "<url><loc>%s</loc>" % (url['location'],)
                if url['lastmod']:
                    yield "<lastmod>%s</lastmod>" % (url['lastmod'].strftime('%Y-%m-%d'),)
                if url['changefreq']:
                    yield "<changefreq>%s</changefreq>" % (url['changefreq'],)
                if url['priority']:
                    yield "<priority>%s</priority>" % (url['priority'],)
                yield "</url>"
        except EmptyPage:
            raise Http404("Page %s empty" % page)
        except PageNotAnInteger:
            raise Http404("No page '%s'" % page)
        yield "</urlset>"

Unique <title>’s

One of the many things Google has been whining about (via webmaster tools) was the fact that some of our pages shared the same titles. I’m unsure yet how this is really affecting the weighting in Google, but it’s a good thing to note none the less.

Robots and nofollow

Another one of the many things I had not been doing until earlier this year, was using robots.txt and rel="nofollow" in order to prevent Google from distributing it’s weight to pages which you don’t really care about. These could be anything from a contact form, to an entire section of your website. This can really help you achieve maximum efficiency for weighting pages which really matter to you in Google, but should also be used cautiously.

  • limist

    Thanks for sharing this information, esp. your code for sitemaps; I too found that I had no choice but to generate static versions on my site with about 100,000 objects, because Yahoo and Google can easily suck up gobs of memory as they visit the sitemap URL. Not to mention the thrashing that can happen as the server swaps out; talk about an easy DoS attack – which can be partly addressed by using non-standard sitemap filenames.

  • Harro

    You have to be careful with added google specific stuff to your app, google doesn’t like to be fooled and could ban your for serving them something differend then normal users.

    But then again.. they shouldn’t b*tch about redirects like that…

  • http://www.omtv.cc omtv

    Is it true that after adding a sitemap to Google, the overall rank of pages for the site being displayed in the search result might drop than before?

  • http://www.davidcramer.net David Cramer

    I have noticed increased traffic on keywords which should have been weighted better before.

  • http://www.davidcramer.net David Cramer

    I have noticed increased traffic on keywords which should have been weighted better before.

blog comments powered by Disqus