I use several crawling tools, and often crawl tests come back saying they were only able to crawl one page, after much trial and error I discovered it was down to my robots.txt file.
This is how it looked like:
User-agent: * Allow: / Disallow: /search/ Disallow: /tmp/
I attempted to validate my robots.txt file with this website http://tool.motoricerca.info/robots-checker.phtml.
I had some initial errors, which included unnecessary empty lines, and also I was using the allow rule. For a valid robots.txt file, you should not use the field of “Allow“, the use of the robots.txt file is to state which robots you want to block from crawling what content with the “Disallow” rule.
After removing my allow rule and my empty lines, my robots file was almost valid: it looked like so:
User-agent: * Disallow: /search/ Disallow: /tmp/
But the validating website was returning this error:
This line doesn’t follow a correct syntax. The correct syntax is: : , where “field” can be “User-agent” or “Disallow”. Please refer to Robots Exclusion Standard page for more informations.
It was frustrating because “how hard can a robots.txt file be!”, It appeared that some strange characters were being read at the beginning of the file. “ï»¿“. After searching the net, I discovered that this was due to my encoding, although my robots file was in the correct encoding of UTF-8, for some reason it was set to “include unicode signature (BOM)”.
After un-ticking this in dreamwever, I re validated the file and success!
“No errors found in this robots.txt file”
So a few things to remember when building your robots.txt file
- Save the file with UTF-8 encoding without BOM
- Do not use the rule “Allow”
- Do not have empty lines
- Correct format for a robots rule is <field>:<optionalspace><value><optionalspace>
- Use online tools to validate and check your robots file!
FYI: Bom is A byte order mark (BOM), which consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.