As it has been discussed many times before, robots.txt plays a very important role in Search Engine Optimization. Search Engine Robots first look at your robots.txt before crawling your site. WordPress, even though very search-engine friendly with its PermaLinks, seem to be having problems with latest algorithms of search engines. This is caused by what we call a “duplicate page filter”. When you have different URLs pointing to the same content, search engines like Google consider the content as copied and therefore can penalize your whole blog!
For example, the following two urls point to the same content in wordpress:
domain.com/category/category-name/post-name
domain.com/category-name/post-name
to avoid the duplicate content filter working on your wordpress blog, I have come up with this exclusive robots.txt file for you to benefit from search engines.
User-agent: Googlebot Disallow: /wp-content/ Disallow: /trackback/ Disallow: /wp-admin/ Disallow: /feed/ Disallow: /archives/ Disallow: /sitemap.xml Disallow: /index.php Disallow: /*? Disallow: /*.php$ Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: */feed/ Disallow: */trackback/ Disallow: /page/ Disallow: /tag/ Disallow: /category/ User-agent: Googlebot-Image Disallow: /wp-includes/ User-agent: Mediapartners-Google* Disallow: User-agent: ia_archiver Disallow: / User-agent: duggmirror Disallow: /
This code will let Google Images index all the files except for the ones in the includes folder, let Google Adsense Bot to visit every page of your blog and make Google bot to ignore other unnecessary duplicate content pages. Adding this robots.txt file will increase your traffic by letting Google pay more attention to your important pages and discard duplicate ones.


Thanks for the robot.txt file, going ot use this on my new blog. =)
If you're disallowing sitemap.xml, doesn't that hurt the Google search process? Why not just turn off the Google sitemap plugin?
sitemap.xml has no search value on search engine result pages. It will not generate any traffic and it may cause duplicate content for some pages. However you shouldn't disable it because of Google webmaster tools.
What's the final verdict on allowing/dis-allowing sitemap.xml?
türkçeleştire bilir misin :)eğer türkçe olursa.. sitemizde yayınlamayı düşünüyorum ..tam anlamadımda
Cheers for that, I have updated my robots file :D
Could you explain the reasons for disabling the following?
Disallow: /index.php
Disallow: /*?
Disallow: /*.php$
Sure,
1- http://www.turkhitbox.co/ and http://www.turkhitbox.com/index.php are same pages, we don't want both of them listed
2- It removes all urls with a question mark.
3- It removes all urls that has .php in the middle.
Aha, so the web server doesn't redirect to the index file when asked for a root of a sub-domain, it just returns a contents of the index file. Cheers
Thanks for this useful info – just one quick question…
If your blog is in a subdirectory of the main site, e.g. http://www.example.com/blog what do you change the lines which start with a */… rather than just / e.g.
Disallow: */feed/
Thanks,
Dan O'Neil
http://www.aquariuscoaching.co.uk
Why on earth would anyone disallow sitemap.xml the whole point of a sitemap is for search engines to read it.
Thanks for the robots.txt :) very helpful indeed.
Just one question – I am curious as to why you would disallow the sitemap though. Its not or it shouldn’t duplicate the content and instead be a unique page that has all your website’s menu items enabling the bots to crawl your website more efficiently.
I am interested in one of your banner spots but can you please send me more info – site stats – unique visits, location of clicks, bounce rates, no. of links leaving the homepage.
how about it?
User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /tag
Disallow: /author
Disallow: /wget/
Disallow: /httpd/
Disallow: /i/
Disallow: /f/
Disallow: /t/
Disallow: /c/
Disallow: /j/
User-agent: Mediapartners-Google
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Image
Allow: /
User-agent: Googlebot-Mobile
Allow: /
User-agent: ia_archiver-web.archive.org
Disallow: /
why disallow page ? is this not mean that robot.txt disallow pages on my web site?