scrapy-splash抓ajax页面

[scrapy] 2024-05-16 圈点808

摘要:scrapy-splash抓ajax页面

如何使用scrapy-splash:


1,利用pip安装scrapy-splash库:

$ pip install scrapy-splash

scrapy-splash使用的是Splash HTTP API, 所以需要一个splash instance,一般采用docker运行splash,所以需要安装docker。


2,安装docker, 安装好后运行docker。

拉取镜像(pull the image):

$ docker pull scrapinghub/splash

用docker运行scrapinghub/splash:

$ docker run -p 8050:8050 scrapinghub/splash

3,配置splash服务(以下操作全部在settings.py):


1)添加splash服务器地址:


SPLASH_URL = 'http://localhost:8050'  


2)将splash middleware添加到DOWNLOADER_MIDDLEWARE中:


DOWNLOADER_MIDDLEWARES = {

'scrapy_splash.SplashCookiesMiddleware': 723,

'scrapy_splash.SplashMiddleware': 725,

'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

}


3)Enable SplashDeduplicateArgsMiddleware:(去重的类)


SPIDER_MIDDLEWARES = {

'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

}


4)Set a custom DUPEFILTER_CLASS:


DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'


5)a custom cache storage backend(如何有使用http 缓存系统,就必须启用这个splash缓存):


HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'



4,示例:


获取HTML内容:

"""

from scrapy import Request  

from scrapy.spiders import Spider  

from scrapy_splash import SplashRequest  

from scrapy_splash import SplashMiddleware  

from scrapy.http import Request, HtmlResponse  

from scrapy.selector import Selector

"""



import scrapy

from scrapy_splash import SplashRequest


class MySpider(scrapy.Spider):

    start_urls = ["http://example.com", "http://example.com/foo"]


    def start_requests(self):

        for url in self.start_urls:

            yield SplashRequest(url, self.parse, args={'wait': 0.5})


    def parse(self, response):

        # response.body is a result of render.html call; it

        # contains HTML processed by a browser.

        # ...        


     如果在settings.py里面启用DEFAULT_REQUEST_HEADERS ,请务必注释掉,详见:https://github.com/scrapy-plugins/scrapy-splash/issues/67            ;由于default_request_headers 里面的host 与会与抓取页面不匹配,会出错,在添加的headers的时候注意这些细节内容。

  

相关内容:

感谢反馈,已提交成功,审核后即会显示