`urllib.robotparser` --- robots.txt 用のパーサー¶

ソースコード: Lib/urllib/robotparser.py

This module provides a single class, RobotFileParser, which answers questions about whether or not a particular user agent can fetch a URL on the website that published the robots.txt file. For more details on the structure of robots.txt files, see RFC 9309.

class urllib.robotparser.RobotFileParser(url='')¶

url の robots.txt に対し読み込み、パーズ、応答するメソッドを提供します。

set_url(url)¶: robots.txt ファイルを参照するための URL を設定します。

read()¶: robots.txt URL を読み出し、パーザに入力します。

parse(lines)¶: 引数 lines の内容を解釈します。

can_fetch(useragent, url)¶: 解釈された robots.txt ファイル中に記載された規則に従ったとき、 useragent が url を取得してもよい場合には True を返します。

mtime()¶: robots.txt ファイルを最後に取得した時刻を返します。この値は、定期的に新たな robots.txt をチェックする必要がある、長時間動作する Web スパイダープログラムを実装する際に便利です。

modified()¶: robots.txt ファイルを最後に取得した時刻を現在の時刻に設定します。

crawl_delay(useragent)¶: 当該の ユーザーエージェント 用の robots.txt の Crawl-delay パラメーターの値を返します。そのようなパラメーターが存在しないか、指定された ユーザーエージェント にあてはまらない、もしくは robots.txt のこのパラメーターのエントリの構文が無効な場合は、 None を返します。

Added in version 3.6.

request_rate(useragent)¶: robots.txt の Request-rate パラメーターの内容を named tuple RequestRate(requests, seconds) として返します。そのようなパラメーターが存在しないか、指定された ユーザーエージェント にあてはまらない、もしくは robots.txt のこのパラメーターのエントリの構文が無効な場合は、 None を返します。

Added in version 3.6.

site_maps()¶: robots.txt の Sitemap パラメーターの内容を list() の形式で返します。そのようなパラメーターが存在しないか、 robots.txt のこのパラメーターのエントリの構文が無効な場合は、 None を返します。

Added in version 3.8.

以下に RobotFileParser クラスの利用例を示します。

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.pythontest.net/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
1
>>> rrate.seconds
1
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.pythontest.net/")
True
>>> rp.can_fetch("*", "http://www.pythontest.net/no-robots-here/")
False

`urllib.robotparser` --- robots.txt 用のパーサー¶

前のトピックへ

次のトピックへ

This page

urllib.robotparser --- robots.txt 用のパーサー¶

`urllib.robotparser` --- robots.txt 用のパーサー¶