Received: with ECARTIS (v1.0.0; list gopher); Wed, 12 Oct 2005 21:52:51 -0500 (CDT) Received: from mo-69-69-114-6.sta.sprint-hsd.net ([69.69.114.6] helo=erwin.lan.complete.org) by glockenspiel.complete.org with esmtps (with TLS-1.0:RSA_AES_256_CBC_SHA:32) (TLS peer CN erwin.complete.org, certificate verified) (Exim 4.50) id 1EPtD1-0000QD-75; Wed, 12 Oct 2005 21:52:50 -0500 Received: from katherina.lan.complete.org ([10.200.0.4]) by erwin.lan.complete.org with esmtps (with TLS-1.0:RSA_AES_256_CBC_SHA:32) (No TLS peer certificate) (Exim 4.50) id 1EPtCr-00035q-Jt; Wed, 12 Oct 2005 21:52:33 -0500 Received: from jgoerzen by katherina.lan.complete.org with local (Exim 4.54) id 1EPtCr-00076P-5W; Wed, 12 Oct 2005 21:52:33 -0500 Date: Wed, 12 Oct 2005 21:52:33 -0500 From: John Goerzen To: gopher@complete.org Subject: [gopher] Re: New Gopher Wayback Machine Bot Message-ID: <20051013025233.GA26984@katherina.lan.complete.org> References: <20051012180132.GA19083@complete.org> <200510122345.QAA17070@floodgap.com> MIME-Version: 1.0 Content-type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200510122345.QAA17070@floodgap.com> User-Agent: Mutt/1.5.11 X-Spam-Status: No (score 0.1): AWL=0.008, FORGED_RCVD_HELO=0.05 X-Virus-Scanned: by Exiscan on glockenspiel.complete.org at Wed, 12 Oct 2005 21:52:50 -0500 Content-Transfer-Encoding: 8bit X-archive-position: 1114 X-ecartis-version: Ecartis v1.0.0 Sender: gopher-bounce@complete.org Errors-to: gopher-bounce@complete.org X-original-sender: jgoerzen@complete.org Precedence: bulk Reply-to: gopher@complete.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: Gopher X-List-ID: Gopher List-subscribe: List-owner: List-post: List-archive: X-list: gopher On Wed, Oct 12, 2005 at 04:45:56PM -0700, Cameron Kaiser wrote: > > Cameron, floodgap.com seems to have some sort of rate limiting and keeps > > giving me a Connection refused error after a certain number of documents > > have been spidered. > > I'm a little concerned about your project since I do host a number of large > subparts which are actually proxied services, and I think even a gentle bot > going methodically through them would not be pleasant for the other side > (especially if you mean to regularly update your snapshot). Valid concern. I had actually already marked your site off-limits because I noticed that. Incidentally, your robots.txt doesn't seem to disallow anything -- might want to take a look at that ;-) [snip] > I do support robots.txt, see > > gopher.floodgap.com/0/v2/help/indexer Do you happen to have the source code for that available? I've got some questions for you that it could explain (or you could), such as: 1. Which would you use? (Do you expect URLs to be HTTP-escaped?) Disallow: /Applications and Games Disallow: /Applications%20and%20Games 2. Do you assume that all Disallow patterns begin with a slash as they do in HTML, even if the Gopher selector doesn't? 3. Do you have any special code to handle the UMN case where 1/foo, /foo, and foo all refer to the same document? I will be adding robots.txt support to my bot and restarting it shortly. Thanks, -- John -- John Goerzen Author, Foundations of Python Network Programming http://www.amazon.com/exec/obidos/tg/detail/-/1590593715