Turk Hit Box

Turk Hit Box

Ramblings of a Turkish Internet Mogul

Top Pages

Recent Articles

Pages

Best Robots.txt For Wordpress

As it has been discussed many times before, robots.txt plays a very important role in Search Engine Optimization. Search Engine Robots first look at your robots.txt before crawling your site. Wordpress, eventhough very search-engine friendly with its PermaLinks, is seem to be having problems with latest algorithms of search engines. This is caused by what we call a “duplicate page filter”. When you have different URLs pointing to the same content, search engines like Google consider the content as copied and therefore can penalize your whole blog!

For example, the following two urls point to the same content in wordpress:

domain.com/category/category-name/post-name
domain.com/category-name/post-name

to avoid the duplicate content filter working on your wordpress blog, I have come up with this exclusive robots.txt file for you to benefit from search engines.

User-agent: Googlebot

Disallow: /wp-content/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /feed/
Disallow: /archives/
Disallow: /sitemap.xml
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: */feed/
Disallow: */trackback/
Disallow: /page/
Disallow: /tag/
Disallow: /category/

User-agent: Googlebot-Image
Disallow: /wp-includes/

User-agent: Mediapartners-Google*
Disallow:

User-agent: ia_archiver
Disallow: /

User-agent: duggmirror
Disallow: /

This code will let Google Images index all the files except for the ones in the includes folder, let Google Adsense Bot to visit every page of your blog and make Google bot to ignore other unnecessary duplicate content pages. Adding this robots.txt file will increase your traffic by letting Google pay more attention to your important pages and discard duplicate ones.

Posted in: Wordpress SEO

13 Responses


Dan on 06.19.2007 at 3:22 am

Thanks for the robot.txt file, going ot use this on my new blog. =)

Gary on 06.19.2007 at 9:07 pm

If you're disallowing sitemap.xml, doesn't that hurt the Google search process?  Why not just turn off the Google sitemap plugin?

Turk Hit Box on 06.19.2007 at 10:54 pm

sitemap.xml has no search value on search engine result pages. It will not generate any traffic and it may cause duplicate content for some pages. However you shouldn't disable it because of Google webmaster tools.  

Wayne Price on 07.03.2007 at 11:53 pm

Where do you put the robots.txt file?

Bill Petro on 07.06.2007 at 12:10 am

What's the final verdict on allowing/dis-allowing sitemap.xml?

eylultoprak on 07.06.2007 at 6:37 pm

türkçeleştire bilir misin :) eğer türkçe olursa.. sitemizde yayınlamayı düşünüyorum ..tam anlamadımda

Kamil Wojcicki on 12.19.2007 at 3:46 pm

Cheers for that, I have updated my robots file :D

Kamil Wojcicki on 12.20.2007 at 9:40 am

Could you explain the reasons for disabling the following?
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$

Turk Hit Box on 12.20.2007 at 2:12 pm

Sure,
1- http://www.turkhitbox.com/ and http://www.turkhitbox.com/index.php are same pages, we don't want both of them listed

2- It removes all urls with a question mark.
3- It removes all urls that has .php in the middle.

Kamil Wojcicki on 12.20.2007 at 3:43 pm

Aha, so the web server doesn't redirect to the index file when asked for a root of a sub-domain, it just returns a contents of the index file. Cheers

Dan O'Neil on 01.08.2008 at 7:29 pm

Thanks for this useful info – just one quick question…

If your blog is in a subdirectory of the main site, e.g. http://www.example.com/blog what do you change the lines which start with a */… rather than just / e.g.
Disallow: */feed/

Thanks,

Dan O'Neil
http://www.aquariuscoaching.co.uk

TK on 10.17.2009 at 3:29 am

Why on earth would anyone disallow sitemap.xml the whole point of a sitemap is for search engines to read it.

Leave a Reply