The best VPN 2024

The Best VPS 2024

The Best C# Book

Scrapy crawler code template and general project

Spread the love

Scrapy crawler code template and general project, When I use scrapy to crawl things I am interested in, I always start project and genspider frequently, and then repetitively add some common function codes to the default template of the scrapy library (sometimes more embarrassing is , Maybe I haven’t used scrapy for a long time, and some of the writing methods have been forgotten), this can’t be blamed on scrapy.

Scrapy crawler code template

After all, the projects and crawler templates created through its commands should be the simplest version. For those who want to study scrapy in depth or use scrapy in depth For example, it’s a bit too picky. Of course, some people will say that you can copy and paste from other projects that have been written. This can also work, but if you can make scrapy use commands to create projects When using crawlers, it should be the best to automatically add in the codes and functions we may use frequently.

If you want to achieve the above lazy purpose, you need to modify or even add scrapy’s default project and crawler template. I will introduce it in detail below, and paste the template that the author has compiled, and I hope that it can help readers, and even draw inferences about it. Readers can do it themselves. Define the projects and crawler templates you want.

Modify the default template ideas

Generally, when creating projects and crawlers, use the scrapy startproject and scrapy genspider commands, then the scrapy command must have been added to the path environment variable and can accept the next two parameters, and complete the related creation work. We can find the scrapy library and enter the commands folder. I found that this is the scrapy terminal command file. By looking at the code in startproject.py and genspider.py, it can be basically confirmed that the templates for creating projects and crawlers are in the templates folder in the scrapy library folder. We open it and we can find that there are two folders project and spiders

project folder

After entering, you will feel very sorted. There is a scrapy.cfg file (the project configuration file). Click to enter the module folder. Inside are the items, settings, pipelines and other project files we see, but the suffix of these files is tmpl, Scrapy crawler code template.

In fact, Scrapy crawler code template. when scrapy created the project, the main work it did was as follows:

  1. Copy the project folder to the folder where scrapy startproject is located or specified
  2. Modify the module folder name to the specified project name
  3. Change the tmpl file in the module folder to a py file, that is, remove the tmpl file suffix

Therefore, we only need to modify the corresponding file in the folder to complete the modification of the default template for creating the project.

Spiders folder

Click to enter, there is a list of available spider templates that we see through the command scrapy genspider -l command such as basic, crawler, etc., so we only need to modify or even add our spider template to the folder (essentially a py file) , It can be created directly through the scrapy genspider command in the future.

Specify the template path

Of course, we directly modify the template in the scrapy library, which has too much impact. If multiple people or multiple projects share a scrapy library, it will definitely directly affect other projects or users, but scrapy allows us to specify the template path , So that the above problems can be solved. Scrapy crawler code template.

We check the commands folder in the scrapy library, Scrapy crawler code template. there is a startproject.py file in it, open it, and you can see the following code at the end (similarly, there is also a code similar to the following in genspider.py):

    @property
    def templates_dir(self):
        return join(
            self.settings['TEMPLATES_DIR'] or join(scrapy.__path__[0], 'templates'),
            'project'
        )

Scrapy crawler code template

This function is used to obtain the template path, you can see that it can be obtained through the TEMPLATES_DIR parameter in the command line settings, or directly use the templates in the scrapy library. In this case, we can

  • Place the templates you use (must include the project and spiders templates) in a specified file path
  • Then when using scrapy startproject or scrapy genspider in the terminal, pass in one more settings parameter later, as follows:
scrapy startproject -s TEMPLATES_DIR='your templates_path' <projectname> 
scrapy genspider -s TEMPLATES_DIR='your templates_path'  <spidername> <domain>

Scrapy crawler code template

Optimize the default project template

The author mainly optimized the settings and pipelines files in the screen blade project template. Scrapy crawler code template.

  1. Settings file: Because the default settings file of scrapy only contains some basic settings, and UserAgent is not very friendly, you generally need to set a random UA by yourself, and there are other common settings that need to be added.
  2. Pipelines file: Because in the pipeline file, the author generally needs to listen to the spider_opened and spider_closed events, and then process some transactions accordingly, such as closing files or disconnecting the database connection when the crawler is closed, so it is often necessary to set the scrapy default pipelines Make changes, and also unified optimization here
  3. Add itemloaders.py.tmpl file, mainly add itemloaders in the project by default, and write regular code structure and reference in it, which is convenient for quickly building your own itemloader, and used to automatically complete more complex or commonly used item cleaning actions

The optimization of pipelines is as follows

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
# useful for handling different item types with a single interface
from scrapy.exceptions import DropItem
 
class ${ProjectName}Pipeline:
	#init the pipe instance
	def __init__(self,crawler):
		#refer settings args like self.args=crawler.settings.args
		pass
 
	@classmethod
	def from_crawler(cls,crawler):
		#called when project start
		p=cls(crawler)
		#register signals to specific pipe methods
		crawler.signals.connect(p.spider_opened, signal=signals.spider_opened)
		crawler.signals.connect(p.spider_closed, signal=signals.spider_closed)
		return p
 
	def spider_opened(self,spider):
		#called when spider is opened
		pass
 
	def process_item(self, item, spider):
		#called for each item that yield by spider
		#must either return item itself or raise DropItem error
		if :
			return item
		else:
			raise DropItem()
 
    def spider_closed(self,spider,reason):
		#called when spider is closed
		pass

Scrapy crawler code template

among them:

  1. The from_crawler class method is mainly to register and listen to the spider_opened and spider_closed signals, and then bind to the corresponding method. If you don’t register, you cannot listen to these two signals and handle them accordingly. Of course, you can also register to listen to other signals. Expand, you can click here to  view more other signal types
  2. Increase the commonly used DropItem structure, because in pipelines, it is generally necessary to filter out invalid items

Scrapy crawler code template. You can directly copy to their pipelines.py.tmpl file, and add common functions and code templates according to their needs, which can greatly improve the efficiency of coding

Settings are optimized as follows

from faker import Faker
ua=Faker()
 
BOT_NAME = '$project_name'
 
SPIDER_MODULES = ['$project_name.spiders']
NEWSPIDER_MODULE = '$project_name.spiders'
 
#auto close spider when countdown default 0
CLOSESPIDER_TIMEOUT = 0
#auto close spider when scrape pagecount
CLOSESPIDER_PAGECOUNT = 0
#auto close spider when scrape itemcount
CLOSESPIDER_ITEMCOUNT = 0
#auto close spider when occur errorcount
CLOSESPIDER_ERRORCOUNT = 0
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ua.user_agent()

Scrapy crawler code template

  1. Mainly add random UA function
  2. In addition, setting items such as crawling a specified number of web pages, items, etc., which are often used to set up during debugging, have been added, so that there is no need to manually copy and paste settings, etc.

Add itemloaders.py.tmpl template

In the project folder, Scrapy crawler code template. add the above files and paste the following code in the files:

# Define here the itemloaders used to process the item
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/loaders.html
# demo codes here, imported into spider :
# from myproject.items import Product
# def parse(self, response):
#     l = ItemLoader(item=Product(), response=response)
#     l.add_xpath('name', '//div[@class="product_name"]')
#     l.add_xpath('name', '//div[@class="product_title"]')
#     l.add_xpath('price', '//p[@id="price"]')
#     l.add_css('stock', 'p#stock]')
#     l.add_value('last_updated', 'today') # you can also use literal values
#     return l.load_item()
 
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst,Join,MapCompose
 
 
class yourItemLoader(ItemLoader):
	# define the itemloader that process the item
	# define the default processor
	default_input_processor=lambda x:x
	default_output_processor=lambda y:y
 
	# define the input and output processor fror specific field
	fieldname_in=MapCompose(func,str)
	fieldname_in=TakeFirst()

Scrapy crawler code template

  1. The above mainly writes the usage of itemloader, to avoid forgetting for too long, but also need to query
  2. Add the code structure of defining your own itemloader to improve the efficiency of coding
  3. Finally, don’t forget to modify the following code in the startproject.py file in the commads folder under scrapy:
 
TEMPLATES_TO_RENDER = (
    ('scrapy.cfg',),
    ('${project_name}', 'settings.py.tmpl'),
    ('${project_name}', 'items.py.tmpl'),
    ('${project_name}', 'pipelines.py.tmpl'),
    ('${project_name}', 'middlewares.py.tmpl'),
    ('${project_name}', 'itemloaders.py.tmpl'),   #add your tempelete file
)

Scrapy crawler code template

Optimize the default crawler template

I often uses the basic and crawler crawler templates, so I only demonstrate the modification and optimization of these two templates. Scrapy crawler code template

basic

Scrapy crawler code template. The optimized code is as follows:

import scrapy,requests
from scrapy import signals
 
class $classname(scrapy.Spider):
	name = '$name'
	allowed_domains = ['$domain']
	start_urls = [
		'http://$domain/',
	]
 
	#spider level settings,highest priority
	custom_settings=dict(
		CLOSESPIDER_ITEMCOUNT=10,
	)
 
 
	@classmethod
	def from_crawler(cls, crawler):
		#called when crawl started
		#register signals with spider methods,such as opened and closed,when you want to do somethings following specific signals
		s=cls()
		crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
		crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
		return s
 
 
	def spider_opend(self,spider):
		#called when spider open,unnecessory
		pass
 
 
	def start_requests(self):
		#called automatically when spider start, only called just once
		#unnecessory,if you want to reset reqeust header ,you can reload this method
		#otherwise,scrapy will generate requests from list start_urls,and call the callback method parse fallback
		#must yield request
		pass
		
 
	def parse(self, response):
		#must either yield requests or items
		pass
 
 
    def spider_closed(self,spider,reason):
		##called when spider close,unnecessory
		pass

Scrapy crawler code template

  1. Added the settings method to define the spider level, you can customize some settings for the spider
  2. Added from_crawler class method, registered spider_opened and spider_closed signals
  3. Added the start_requests method, if you don’t need it, you can delete it, otherwise, you don’t need to write the code by yourself

crawler

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy import signals
 
class $classname(CrawlSpider):
    name = '$name'
    allowed_domains = ['$domain']
    start_urls = ['http://$domain/']
 
    rules = (
        #use re to extract urls from start_urls , and then parse by callback methods
        #:allow:str or tuple ,specify urls that extract with re
        #:allow:str or tuple ,specify urls that won't extract with re
        #:restrict_xpaths: str or tuple ,specify area that extract urls with xpath
        Rule(LinkExtractor(allow=r'Items/',deny='',restrict_xpaths='',), follow=True),
        Rule(LinkExtractor(allow=r'tags/'), callback='parse_item'),
    )
 
 
    @classmethod
    def from_crawler(cls, crawler):
        #called when crawl started
        #register signals with spider methods,such as opened and closed,when you want to do somethings following specific signals
        s=cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s
 
    def spider_opend(self,spider):
        #called when spider open,unnecessory
        pass
 
 
    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item
 
 
    def spider_closed(self,spider,reason):
        ##called when spider close,unnecessory
        pass

Scrapy crawler code template

  1. Added from_crawler class method, registered spider_opened and spider_closed signals
  2. Added the grammatical structure commonly used in Rules to avoid writing code by yourself or forgetting the grammar

Add custom spider template

microservices

You can create your own .tmpl file in the spiders folder, write your own crawler spider template, and use it directly in the command line through scrapy genspider -t [templatename]

Conclusion

In fact, scrapy allows you to further customize the library, including middlewares files, and even add command line commands in the template, or directly modify the source code of the scrapy library. Scrapy crawler code template.

Or, if the selector of scrapy is more comfortable to use (I think it is better than lxml and beatifulsoup), because it integrates xpath, css and re, and supports chain operation, you can also introduce the selector into your own module and use it directly , Including LinkExtracto, etc.

Leave a Comment