RSS

Python Web Crawler Class

07 Mar
Web Crawler

web crawler

import sys
import re
import urllib2
import urlparse
import datetime
import os

class WebCrawler:
tocrawl = set([])
crawled = set([])
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

def setBaseURL(self,url):
self.tocrawl = set([url])

def run(self):
url = raw_input('Enter an URL to crawl\n')
self.setBaseURL(url)
dir = raw_input('Where should I put crawled HTML source files?\n')
self.crawl(dir)

def getTitle(self,html):
startPos = html.find('<title>')
if startPos != -1:
endPos = html.find('</title>', startPos+7)
if endPos != -1:
title = html[startPos+7:endPos]
return title

def writeToFile(self,url):
with open('hyperlinks.txt', 'a') as file:
file.write(url + '\n')

def writeHTML(self,fileName,html,dirPath):
self.verifyDir(dirPath)
fileName = re.sub('[\\/:"*?<>|]',"",fileName)
with open(dirPath+'\\'+ fileName + '.txt', 'w+') as file:
file.write(html)

def verifyDir(self,path):
if not os.path.exists(path):
os.makedirs(path)

def crawl(self,dir_path):
while 1:
try:
if self.tocrawl:
crawling = self.tocrawl.pop()
print '\n\nStart Crawling - ' + crawling + '\n'
except KeyError:
raise StopIteration
url = urlparse.urlparse(crawling)
try:
response = urllib2.urlopen(crawling)
except:
continue
msg = response.read()
self.writeHTML(crawling,msg,dir_path)

#Display page title
print self.getTitle(msg)

links = self.linkregex.findall(msg)
self.crawled.add(crawling)
self.writeToFile(crawling)
for link in links:
if link.startswith('mailto'):
continue
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in self.crawled:
print '----' + link
self.tocrawl.add(link)

How to run this crawler?

Run following command in Python shell.

import web_crawler
w = web_crawler.WebCrawler()
w.run()
Advertisements
 
6 Comments

Posted by on March 7, 2013 in Python

 

Tags:

6 responses to “Python Web Crawler Class

  1. psn code gratuit septembre 2013

    August 29, 2013 at 10:25 pm

    I blog quite often and I truly appreciate your content.
    This great article has truly peaked my interest. I’m going to bookmark your website and keep checking for new details about once a week. I opted in for your RSS feed too.

     
  2. Time Timer

    September 18, 2013 at 10:04 am

    I really like your blog.. very nice colors &
    theme. Did you make this website yourself
    or did you hire someone to do it for you? Plz answer back as I’m looking to construct my own blog and
    would like to know where u got this from. thank you

     
    • Rajitha

      September 18, 2013 at 5:03 pm

      Hi,
      Thanks for commenting. If you really need to create a nice blog you can use ready to use templates. If you ‘re going to host your blog yourself in WordPress there ‘re no limits. You can select and upload a theme by yourself. But if you ‘re going to use a wordpress.com blog like this, you have few options. However in wordpress.com also there ‘re some types of pre-built templates. This blog uses one such template.
      Anyway, I’m afraid whether you clear your doubts.

       
  3. WiFi Password Hack 2013

    September 19, 2013 at 8:23 pm

    Hello, I think your blog could possibly be having internet browser compatibility issues.
    Whenever I take a look at your site in Safari, it looks fine but when opening in I.E., it’s got some overlapping issues.
    I just wanted to give you a quick heads up!
    Aside from that, excellent website!

     
    • Rajitha

      September 20, 2013 at 6:43 pm

      Hi,
      Thanks for commenting. According to your details, I checked https://codezone4.wordpress.com in Internet Explorer including some previous versions. But I could not find any compatibility issues. So I can not agree with this fact. I’m highly confident that the site works fine even in mobile devices. Can you please mention the IE version you tested it?

       
  4. vila Bran

    January 18, 2015 at 10:16 pm

    Thanks a bunch ffor sharing this with all of us youu
    actually recognise what yyou are speaking about! Bookmarked.
    Kindly also seek advice from my website =). We can have a link exchange agreement among us

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: