Preventing development and staging sites from showing up in Google search results
Note: Since I first wrote this article, I establishd some best practicies. See the new version of this article.
At some point, most developers have found that, despite their best efforts, one of their staging or develoment sites has been found by Google or other search engines. Usually, this is because we must make the work available for preview to the client before the site goes live to the world. Without careful planning, it can be tricky to show a staging site to a client, but not to Google. Do not assume that, because you haven't submitted a URL to search engines, they won't find your site. Take proactive measures to keep your dev and staging sites out of search engines.
There are a number of ways to accomplish this. For the purposes of this article, I'm going to assume we're working in Drupal – although most of it applies to any environment.
The elements we'll be working with are:
- local DNS
- IP-based restrictions
- restrictions based on registered user or role
Naturally, the best way to attack this problem is to restrict search engine access to the site from the outset. Depending on how we restrict access, we'll need to be very careful not to push the restrictions forward to our production site. Neglecting that issue might mean that our production site also falls completely off the map. In that case, the cure might be worse than the disease.
One of the most effective ways to hide a development or staging site from the rest of the world is by creating the DNS entry for the site on our local workstation only. On a Mac, for example, we can enter the following in the /etc/hosts file:
This line resolves staging.example.com to our server's IP address. Assuming we don't have an entry in the domain's public DNS, then the staging site will be seen only on the one workstation where we created this A-record.
In my opinion, this is the best way to do it. Assuming our server is provisioned correctly and that we know what we're doing with DNS, it is the most comprehensive solution; neither the public nor search engines can see the site. As an added benefit, we don't have to change anything on our site... something we might forget to "undo" when we push our changes to the production server.
Unfortunately, though, this method can prove difficult for our clients to implement. Often, they don't have access to the /etc/hosts file (or the Windows equivalent). This works better for development sites than for staging sites.
Note: This method makes use of "security through obfuscation", meaning that the pages are available, but only to those know where to look. In this case, knowing "where to look" involves knowing that a particular hostname must resolve to a particular IP address. Unless we're dealing with sensitive data, I find this works great for our intended purpose. But by no means am I'm implying this is secure.
IP-based or user account restrictions
If we want to make the site available to a wider audience (say, our client's entire subnet), we can create IP-based restrictions within Drupal (or whatever CMS we might be using).
In a typical Drupal installation, you can set up access rules via /admin/user/rules. You just need to create the rules like this (substituting your office's network IP address for xxx.xxx.xxx.xxx):
By denying access to all (%.%.%.%) and allowing only our office IP, we have shut everyone else out. We'll need to allow our client's IP address here, as well.
It probably goes without saying, but be very careful when setting up these rules. If you do it wrong, then you'll lock yourself out. (You can always go directly to the "access" table in the dbase to clear the restrictions, but it's best if you don't have to.)
If you're using drush to maintain your site, you might find that restrictive access rules get in your way. In that case, drush will most likely give you an error something like: "Sorry, 127.0.0.1 has been banned." In that case, you'll need to allow that IP address.
Once the rules are set up, you can test them:
That's all we need to do. However, when we launch, we'll need to remove the access rules.
Another similar approach is to set up Drupal user accounts for everyone who needs to view the site. On most production sites, you probably already have user acounts (and roles) for your editors on the client end. In that case, just set your permissions very strictly:
By removing the permission to "access content" by anonymous users, we've made it so you can't see anything without having an account and being logged in.
Using either of these two methods, you'll need to be sure that you removed the restrictions when you push the revisions to the production server.
Restricting crawling via robots.txt
The last method I'll mention is to alter the site's robots.txt file. A typical robots.txt for a production site might be as simple as:
This is the least restrictive. It says that all crawlers are allowed to crawl the entire site.
For our dev or staging site, we want to use the following:
This requests that the entire site not be crawled.
Unlike the Drupal methods mentioned above, the robot.txt method requires that you change files, not the database. You'll need to make sure you set the robots.txt file back to normal production values before you take your site live.
Oops, it's too late... what do I do?
If you didn't take proper precautions prior to creating your dev or staging sites, there's a good chance that the search engines found your work-in-progress. What now? Well, let's be careful here. First, understand that search engines will cache your site for a certain length of time. Second, you'll need to keep in mind that restricting crawling of your site does not mean that existing indexed pages will disappear from search engine results.
If you find your staging site pages in search results, it's a good idea to go ahead and tell search engines not to index each page. The best way to is to add a "noindex" meta tag to all your pages. The noindex tag looks like this:
This can be tricky to do on large web sites, because a typical CMS-driven site might have several different types of content. You'll need to be sure that whatever tool you use to create meta tags is covering all the bases (list pages, pages, news, blog, etc.) This can be done pretty easily with the Nodewords module, but you'll probably need to play with it a bit to understand how it works; it's not the most intuitive interface.
You'll also need to be really careful that you remove all the noindex tags when you launch.
As the search engine re-crawls your site, it should see that it is disallowed from indexing every single page. Therefore, the entire site should eventually be gone. (Yeah, I said "eventually". Every interaction with search engines takes time. That's why we would like to avoid being in this situation in the first place.)
The difference between Disallow and Noindex
"So," you ask, "why go through all this trouble? Why not just change the robots.txt file and be done with it?" The problem with that is that the robots file only tells search engines what they are allowed (or disallowed) from crawling; they don't tell search engines that the pages don't exist, nor do they restrict indexing. Therefore, setting a restrictive robots setting will not remove pages from search results. If the page is already in search results, or links to the pages in question exist elsewhere, you still have a problem.
In my experience, it's best to do these things in the right order:
- Add a "noindex" meta tag to all pages.
- Make sure the sitemap.xml is up to date.
- Once you've confirmed that the pages are no longer in search results, use one of the methods above to prevent crawling.
Also, it's a good idea to use Google's Webmaster Tools to request removal of the entire site:
Be sure to consider what impact your decisions might have on your site's SEO. For example, it takes a long time to get new domains to show up in search results. Therefore, if you're working on a site for a new domain, it's a good idea to get something open to public viewing as soon as possible. On such a site, I often create a splash page as soon as I get the basic Drupal installation running. From my standpoint, because the domain and site are unkown to search engines, it's not really important to have a nice desgn for the splash page... just a simple "coming soon" with the name of the site and most important keyphrase, in a simple core theme.
After I launch that page and submit it to search engines, I'll limit access to the other pages during design and CMS development. This is pretty easy within Drupal by using a module like Content Access, which gives you granular control over site access by content type. First, I create a new content type called "Front Page Temp", giving view access to anonymous users (which is the default). The only thing I'll use that content type for is the splash page during development.
Then, for all other content types, lock down access so that anonymous users may not view any content:
When you're ready to launch, you'll need to switch the front page to point to the correct node, delete the old temporary front page, delete the temporary front page content type, and then open up the anonymous access settings for all content types.
Regardless what route you choose, you'll want to conduct tests to make sure you've covered your bases. If you're restricting by local DNS or IP address, test from other networks to make sure the staging site is invisible. There are a number of remote rendering engines (typically used to render sites in multiple browsers) that are helpful. (http://netrenderer.com/ comes to mind.) If they can see your site, search engines probably can, as well. The same applies to restricting by user account; you can test using remote rendering sites.
Additionally, it would be wise to use something like Google's Webmaster Tools. With it, you can test your site's crawler settings against individual pages.