"阅读目录获取 setting.py 中的值获取图片的 url 地址下载图片保存 item 到 json 文件保存到 mysql 中（两种方法）优化 item 类（重要）获取一个页面的全部 url 日志的使用 logging 设置保存到 mongdb 数据库下载中间件，随机更换 us ...."

someone1
AI专业者开发社区 1 号成员
AI开发者社区 • 0 回帖 • 4 浏览 • 5 个月前

scrapy框架使用教程

阅读目录

scrapy 框架真的是很强大。非常值得学习一下。本身 py 就追求简洁，所以本身代码量很少却能写出很强大的功能。对比 java 来说。不过 py 的语法有些操蛋，比如没有智能提示。动态语言的通病。我也刚学习不到 1 周时间。记录一下。全部干货。

首先安装 scrapy 框架。选择的 ide 是 pycharm。

创建一个 scrapy 项目。项目名称 xxoo

scrapy startproject xxoo

会得到一个项目目录。具体目录的作用自己百度下。然后再用一条命令创建一个爬虫类。就是一个模板。帮我们创建好的类。我们只需要写逻辑就行。程序员的天性就是懒！！！

意思是创建了一个 xxooSpider 的类这个类只爬取 baidu.com 这个网站

scrapy genspider [-t template] <name> <domain>   即：scrapy genspider xxooSpider baidu.com

在 pycharm 中调试项目。

需要特殊配置下。

在根目录下创建一个 start.py 的文件。 -o itcast1.csv 是输出到 csv 文件中。可以不加

from scrapy import cmdline 
cmdline.execute("scrapy crawl xxooSpider --nolog -o itcast1.csv".split())

就 ok 了。

使用豆瓣镜像源下载

pip install -i https://pypi.doubanio.com/simple/ scrapy-splash

回到顶部

获取 setting.py 中的值

from scrapy.conf import settings
cookie = settings['COOKIE']

回到顶部

获取图片的 url 地址

大牛通常使用这个方法。原因是，我们一般情况下也可以直接得到 src 属性的值。但是，有时候 src 属性的值没有带网址前缀，比如说是 /img/1.png 这样。我们需要手动加上 http://www.baidu.com 才可以。用下面这个方法。可以很简单的解决这个问题。

from urllib import parse
url="http://www.baidu.com/xx"
xx="/pic/1/1.png"
urljoin = parse.urljoin(url, xx)
print(urljoin)
http://www.baidu.com/pic/1/1.png

回到顶部

下载图片

scrapy 给我们提供好了图片下载的模板。我们只需要在 setting 中指定一下管道中间件，和需要下载的字段。需要下载的字段值一定是数组类型，不然报错

ITEM_PIPELINES = {
   'xxoo.pipelines.XxooPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 1,
}
#在 item 中定义图片 url 的字段，ImagesPipeline 会自动下载这个 url 地址
IMAGES_URLS_FIELD="image"
#存放的路径, 根目录下的 img 文件夹
IMAGES_STORE=os.path.join(os.path.abspath(os.path.dirname(__file__)),"img")

但是按照上面的写的话，全部都是由 scrapy 帮我们做了，自己生成文件夹，文件名。非常不可控。如果我们想自定义的话。我们需要继承 ImagesPipeline 类，重写几个方法

from scrapy.pipelines.images import ImagesPipeline
import re
from scrapy import Request
class ImagesrenamePipeline(ImagesPipeline):
# 1 看源码可以知道，这个方法只是遍历出我们指定的图片字段，是个数组，然后一个一个请求
def get_media_requests(self, item, info):
# 循环每一张图片地址下载，若传过来的不是集合则无需循环直接 yield
for image_url in item['imgurl']:
# meta 里面的数据是从 spider 获取，然后通过 meta 传递给下面方法：file_path
yield Request(image_url,meta={'name':item['imgname']})
</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 2重命名，若不重写这函数，图片名为哈希，就是一串乱七八糟的名字</span>
<span style="color: rgba(0, 0, 255, 1)">def</span> file_path(self, request, response=None, info=<span style="color: rgba(0, 0, 0, 1)">None):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 提取url前面名称作为图片名。</span>
    image_guid = request.url.split(<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">/</span><span style="color: rgba(128, 0, 0, 1)">'</span>)[-1<span style="color: rgba(0, 0, 0, 1)">]
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 接收上面meta传递过来的图片名称</span>
    name = request.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">name</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 过滤windows字符串，不经过这么一个步骤，你会发现有乱码或无法下载</span>
    name = re.sub(r<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">[？\\*|“&lt;&gt;:/]</span><span style="color: rgba(128, 0, 0, 1)">'</span>, <span style="color: rgba(128, 0, 0, 1)">''</span><span style="color: rgba(0, 0, 0, 1)">, name)
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 分文件夹存储的关键：{0}对应着name；{1}对应着image_guid</span>
    filename = u<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">{0}/{1}</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">.format(name, image_guid)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> filename

</span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">3这个是请求完成之后走的方法，我们可以得到请求的url和存放的地址</span>
<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> item_completed(self, results, item, info):
    </span><span style="color: rgba(0, 0, 255, 1)">pass</span></pre>

回到顶部

保存 item 到 json 文件

自定义的

import codecs
import json
class jsonwrite(object):
    # 初始化，打开文件
    def __init__(self):
        self.file = codecs.open("xxoo.json", "w",encoding="utf-8")
    # scrapy 会走这个方法进行 item 的写入
    def process_item(self,item,spider):
        self.file.write(json.dumps(dict(item),ensure_ascii=False) + "\n")
    # 通常是关闭文件的操作
    def spider_closed(self,spider):
        self.file.close()

scrapy 给我们提供的

from scrapy.exporters import JsonItemExporter
class JsonExporterPipleline(object):
    #调用 scrapy 提供的 json export 导出 json 文件
    def __init__(self):
        self.file = open('articleexport.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)self.exporter.start_exporting()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> close_spider(self, spider):
    self.exporter.finish_exporting()
    self.file.close()

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_item(self, item, spider):
    self.exporter.export_item(item)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> item</pre>

回到顶部

保存到 mysql 中（两种方法）

import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
class MysqlPipeline(object):
#采用同步的机制写入 mysql
def init(self):
self.conn = MySQLdb.connect('192.168.0.106', 'root', 'root', 'article_spider', charset="utf8", use_unicode=True)
self.cursor = self.conn.cursor()
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_item(self, item, spider):
    insert_sql </span>= <span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(128, 0, 0, 1)">
        insert into jobbole_article(title, url, create_date, fav_nums)
        VALUES (%s, %s, %s, %s)
    </span><span style="color: rgba(128, 0, 0, 1)">"""</span><span style="color: rgba(0, 0, 0, 1)">
    self.cursor.execute(insert_sql, (item[</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">title</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">url</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">create_date</span><span style="color: rgba(128, 0, 0, 1)">"</span>], item[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">fav_nums</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">]))
    self.conn.commit()

#采用异步数据库连接池的方法
class MysqlTwistedPipline(object):
def init(self, dbpool):
self.dbpool = dbpool
@classmethod
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> from_settings(cls, settings):
    dbparms </span>=<span style="color: rgba(0, 0, 0, 1)"> dict(
        host </span>= settings[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MYSQL_HOST</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
        db </span>= settings[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MYSQL_DBNAME</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
        user </span>= settings[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MYSQL_USER</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
        passwd </span>= settings[<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MYSQL_PASSWORD</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">],
        charset</span>=<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">utf8</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">,
        cursorclass</span>=<span style="color: rgba(0, 0, 0, 1)">MySQLdb.cursors.DictCursor,
        use_unicode</span>=<span style="color: rgba(0, 0, 0, 1)">True,
    )
    dbpool </span>= adbapi.ConnectionPool(<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">MySQLdb</span><span style="color: rgba(128, 0, 0, 1)">"</span>, **<span style="color: rgba(0, 0, 0, 1)">dbparms)

    </span><span style="color: rgba(0, 0, 255, 1)">return</span><span style="color: rgba(0, 0, 0, 1)"> cls(dbpool)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_item(self, item, spider):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">使用twisted将mysql插入变成异步执行</span>
    query =<span style="color: rgba(0, 0, 0, 1)"> self.dbpool.runInteraction(self.do_insert, item)
    query.addErrback(self.handle_error, item, spider) </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">处理异常</span>

<span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> handle_error(self, failure, item, spider):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">处理异步插入的异常</span>
    <span style="color: rgba(0, 0, 255, 1)">print</span><span style="color: rgba(0, 0, 0, 1)"> (failure)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> do_insert(self, cursor, item):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">执行具体的插入</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">根据不同的item 构建不同的sql语句并插入到mysql中</span>
    insert_sql, params =<span style="color: rgba(0, 0, 0, 1)"> item.get_insert_sql()
    cursor.execute(insert_sql, params)</span></pre>

回到顶部

优化 item 类（重要）

我们可以用 xpath 或者 css 解析页面，然后写一些判断逻辑。如果你不嫌麻烦的话。

scrapy 给我们提供了一整套的流程。可以让代码变得非常精简。处理 item 的业务逻辑在 item 中写。爬虫文件只写 item 的生成规则。

先看 item 类

from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
#一个小技巧，可以覆盖默认的规则，就是 TakeFirst() 把列表转换成字符串，我们这里不让转成字符串，还是数组
def return_value(value):
return value
#因为通过自带的 ItemLoader 类生成的 item_loader 他都是 list，所以我们自定义下。默认的处理规则（可以单个字段覆盖），这样就不用每个字段都写重复的代码了
class ArticleItemLoader(ItemLoader):
#自定义 itemloader
default_output_processor = TakeFirst()
#自定义的 item 类。input_processor 是指需要处理的业务逻辑，比如一些格式的转换什么的，output_processor 可以覆盖默认的规则。
class JobBoleArticleItem(scrapy.Item):
title = scrapy.Field()
create_date = scrapy.Field(
input_processor=MapCompose(date_convert),
)
url = scrapy.Field()
url_object_id = scrapy.Field()
front_image_url = scrapy.Field(
output_processor=MapCompose(return_value)
)
front_image_path = scrapy.Field()
praise_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
)
comment_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
)
fav_nums = scrapy.Field(
input_processor=MapCompose(get_nums)
)
tags = scrapy.Field(
input_processor=MapCompose(remove_comment_tags),
output_processor=Join(",")
)
content = scrapy.Field()

爬虫类

from scrapy.loader import ItemLoader
from ArticleSpider.items import JobBoleArticleItem, ArticleItemLoader
def parse_detail(self, response):
article_item = JobBoleArticleItem()
#通过 item loader 加载 item
front_image_url = response.meta.get("front_image_url", "")  # 文章封面图
item_loader = ArticleItemLoader(item=JobBoleArticleItem(), response=response)
item_loader.add_css("title", ".entry-header h1::text")
item_loader.add_value("url", response.url)
item_loader.add_value("url_object_id", get_md5(response.url))
item_loader.add_css("create_date", "p.entry-meta-hide-on-mobile::text")
item_loader.add_value("front_image_url", [front_image_url])
item_loader.add_css("praise_nums", ".vote-post-up h10::text")
item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
item_loader.add_css("fav_nums", ".bookmark-btn::text")
item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
item_loader.add_css("content", "div.entry")
    article_item </span>=<span style="color: rgba(0, 0, 0, 1)"> item_loader.load_item()


    </span><span style="color: rgba(0, 0, 255, 1)">yield</span> article_item</pre>

回到顶部

获取一个页面的全部 url

我们当然可以用 xpath 得到，但是还不够精简。我们可以使用 linkExtractor 类来得到。非常的简单。

from scrapy.linkextractors import LinkExtractor
# 需要搞一个对象实例，然后写一个符合的规则，利用 extract_links 方法传一个 response 过去就能得到这个页面匹配的 url
link = linkExtractor=LinkExtractor(allow=r'http://lab.scrapyd.cn')#     link = linkExtractor=LinkExtractor()#allow=r'http://lab.scrapyd.cn/archives/\d+.html'
links = link.extract_links(response)
if links:
   for link_one in links:
       print(link_one)

回到顶部

日志的使用

Scrapy 提供了 log 功能，可以通过 logging 模块使用。

可以修改配置文件 settings.py，任意位置添加下面两行，效果会清爽很多。

LOG_FILE = "TencentSpider.log"
LOG_LEVEL = "INFO"

Log levels
Scrapy 提供 5 层 logging 级别:
CRITICAL - 严重错误 (critical)
ERROR - 一般错误 (regular errors)
WARNING - 警告信息 (warning messages)
INFO - 一般信息 (informational messages)
DEBUG - 调试信息 (debugging messages)

logging 设置

通过在 setting.py 中进行以下设置可以被用来配置 logging:

LOG_ENABLED 默认: True，启用 logging
LOG_ENCODING 默认: 'utf-8'，logging 使用的编码
LOG_FILE 默认: None，在当前目录里创建 logging 输出文件的文件名
LOG_LEVEL 默认: 'DEBUG'，log 的最低级别
LOG_STDOUT 默认: False 如果为 True，进程所有的标准输出 (及错误) 将会被重定向到 log 中。例如，执行 print "hello" ，其将会在 Scrapy log 中显示。

回到顶部

保存到 mongdb 数据库

import pymongo
from scrapy.conf import settings
class DoubanPipeline(object):
def init(self):
host = settings["MONGODB_HOST"]
port = settings["MONGODB_PORT"]
dbname = settings["MONGODB_DBNAME"]
sheetname= settings["MONGODB_SHEETNAME"]
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 创建MONGODB数据库链接</span>
    client = pymongo.MongoClient(host = host, port =<span style="color: rgba(0, 0, 0, 1)"> port)
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 指定数据库</span>
    mydb =<span style="color: rgba(0, 0, 0, 1)"> client[dbname]
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 存放数据的数据库表名</span>
    self.sheet =<span style="color: rgba(0, 0, 0, 1)"> mydb[sheetname]

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> process_item(self, item, spider):
    data </span>=<span style="color: rgba(0, 0, 0, 1)"> dict(item)
    self.sheet.insert(data)
    </span><span style="color: rgba(0, 0, 255, 1)">return</span> item</pre>

setting 文件

# MONGODB 主机名 MONGODB_HOST = "127.0.0.1"

# MONGODB 端口号
MONGODB_PORT = 27017

# 数据库名称
MONGODB_DBNAME = "Douban"

# 存放数据的表名称
MONGODB_SHEETNAME = "doubanmovies"

回到顶部

下载中间件，随机更换 user-Agent 和 ip

import random
import base64
from settings import USER_AGENTS
from settings import PROXIES
# 随机的 User-Agent
class RandomUserAgent(object):
def process_request(self, request, spider):
useragent = random.choice(USER_AGENTS)
#print useragent
request.headers.setdefault("User-Agent", useragent)
class RandomProxy(object):
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
    </span><span style="color: rgba(0, 0, 255, 1)">if</span> proxy[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">user_passwd</span><span style="color: rgba(128, 0, 0, 1)">'</span>] <span style="color: rgba(0, 0, 255, 1)">is</span><span style="color: rgba(0, 0, 0, 1)"> None:
        </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 没有代理账户验证的代理使用方式</span>
        request.meta[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">proxy</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">http://</span><span style="color: rgba(128, 0, 0, 1)">"</span> + proxy[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ip_port</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">]

    </span><span style="color: rgba(0, 0, 255, 1)">else</span><span style="color: rgba(0, 0, 0, 1)">:
        </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对账户密码进行base64编码转换</span>
        base64_userpasswd = base64.b64encode(proxy[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">user_passwd</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(0, 0, 0, 1)">])
        </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)"> 对应到代理服务器的信令格式里</span>
        request.headers[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Proxy-Authorization</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">Basic </span><span style="color: rgba(128, 0, 0, 1)">'</span> +<span style="color: rgba(0, 0, 0, 1)"> base64_userpasswd

        request.meta[</span><span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">proxy</span><span style="color: rgba(128, 0, 0, 1)">'</span>] = <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">http://</span><span style="color: rgba(128, 0, 0, 1)">"</span> + proxy[<span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">ip_port</span><span style="color: rgba(128, 0, 0, 1)">'</span>]</pre>

setting 文件

USER_AGENTS = [
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)',
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2)',
    'Opera/9.27 (Windows NT 5.2; U; zh-cn)',
    'Opera/8.0 (Macintosh; PPC Mac OS X; U; en)',
    'Mozilla/5.0 (Macintosh; PPC Mac OS X; U; en) Opera 8.0',
    'Mozilla/5.0 (Linux; U; Android 4.0.3; zh-cn; M032 Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
    'Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13'
]
PROXIES = [
{"ip_port" :"121.42.140.113:16816", "user_passwd" : "mr_mao_hacker:sffqry9r"},
#{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
#{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
#{"ip_prot" :"121.42.140.113:16816", "user_passwd" : ""}
]

回到顶部

登陆的三种方法

1，直接找到登陆接口，提供账号密码进行登陆，也是最简单的。

2，有时候需要从登录页找到隐藏的值，然后提交到后台，比如知乎就需要在登录页得到 _xsrf，

3，最麻烦的一种，对方各种加密验证，我们可以采用 cookie 进行登陆。

分别写三个代码参考下：

1. 简单

# -*- coding: utf-8 -*-
import scrapy
# 只要是需要提供 post 数据的，就可以用这种方法，
# 下面示例：post 数据是账户密码
class Renren1Spider(scrapy.Spider):
name = "renren1"
allowed_domains = ["renren.com"]
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> start_requests(self):
    url </span>= <span style="color: rgba(128, 0, 0, 1)">'</span><span style="color: rgba(128, 0, 0, 1)">http://www.renren.com/PLogin.do</span><span style="color: rgba(128, 0, 0, 1)">'</span>
    <span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)"> scrapy.FormRequest(
            url </span>=<span style="color: rgba(0, 0, 0, 1)"> url,
            formdata </span>= {<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">email</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">mr_mao_hacker@163.com</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">password</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">alarmchime</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">},
            callback </span>=<span style="color: rgba(0, 0, 0, 1)"> self.parse_page)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_page(self, response):
    with open(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">mao2.html</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">w</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) as filename:
        filename.write(response.body)</span></pre>

View Code

2. 中等

# -*- coding: utf-8 -*-
import scrapy
# 正统模拟登录方法：
# 首先发送登录页面的 get 请求，获取到页面里的登录必须的参数，比如说 zhihu 的 _xsrf
# 然后和账户密码一起 post 到服务器，登录成功
class Renren2Spider(scrapy.Spider):
name = "renren2"
allowed_domains = ["renren.com"]
start_urls = (
"http://www.renren.com/PLogin.do",
)
</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse(self, response):
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">_xsrf = response.xpath("//_xsrf").extract()[0]</span>
    <span style="color: rgba(0, 0, 255, 1)">yield</span><span style="color: rgba(0, 0, 0, 1)"> scrapy.FormRequest.from_response(
            response,
            formdata </span>= {<span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">email</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">mr_mao_hacker@163.com</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">password</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">alarmchime</span><span style="color: rgba(128, 0, 0, 1)">"</span>},<span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">, "_xsrf" = _xsrf},</span>
            callback =<span style="color: rgba(0, 0, 0, 1)"> self.parse_page
        )

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_page(self, response):
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">=========1===</span><span style="color: rgba(128, 0, 0, 1)">"</span> +<span style="color: rgba(0, 0, 0, 1)"> response.url
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">with open("mao.html", "w") as filename:</span>
    <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">    filename.write(response.body)</span>
    url = <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">http://www.renren.com/422167102/profile</span><span style="color: rgba(128, 0, 0, 1)">"</span>
    <span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.Request(url, callback =<span style="color: rgba(0, 0, 0, 1)"> self.parse_newpage)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_newpage(self, response):
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">===========2====</span><span style="color: rgba(128, 0, 0, 1)">"</span> +<span style="color: rgba(0, 0, 0, 1)"> response.url
    with open(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">xiao.html</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">w</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) as filename:
        filename.write(response.body)</span></pre>

View Code

3. 困难

# -*- coding: utf-8 -*-
import scrapy
# 实在没办法了，可以用这种方法模拟登录，麻烦一点，成功率 100%
class RenrenSpider(scrapy.Spider):
name = "renren"
allowed_domains = ["renren.com"]
start_urls = (
'http://www.renren.com/xxxxx',
'http://www.renren.com/11111',
'http://www.renren.com/xx',
)
cookies </span>=<span style="color: rgba(0, 0, 0, 1)"> {
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">anonymid</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ixrna3fysufnwv</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">_r01_</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">1</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ap</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">327550029</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">JSESSIONID</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">abciwg61A_RvtaRS3GjOv</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">depovince</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">GW</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">springskin</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">set</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">jebe_key</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1486198628950</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">jebe_key</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">f6fb270b-d06d-42e6-8b53-e67c3156aa7e%7Cc13c37f53bca9e1e7132d4b58ce00fa3%7C1484060607478%7C1%7C1486198619601</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ver</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">7.0</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">XNESSESSIONID</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">e703b11f8809</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">jebecookies</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">98c7c881-779f-4da8-a57c-7464175cd469|||||</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ick_login</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">4b4a254a-9f25-4d4a-b686-a41fda73e173</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">_de</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">BF09EE3A28DED52E6B65F6A4705D973F1383380866D39FF5</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">p</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ea5541736f993365a23d04c0946c10e29</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">first_login_flag</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">1</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ln_uact</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">mr_mao_hacker@163.com</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">ln_hurl</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">http://hdn.xnimg.cn/photos/hdn521/20140529/1055/h_main_9A3Z_e0c300019f6a195a.jpg</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">t</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">691808127750a83d33704a565d8340ae9</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">societyguester</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">691808127750a83d33704a565d8340ae9</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">id</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">327550029</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">xnsid</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">f42b25cf</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">,
</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">loginfrom</span><span style="color: rgba(128, 0, 0, 1)">"</span> : <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">syshome</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">
}

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> start_requests(self):
    </span><span style="color: rgba(0, 0, 255, 1)">for</span> url <span style="color: rgba(0, 0, 255, 1)">in</span><span style="color: rgba(0, 0, 0, 1)"> self.start_urls:
    </span><span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">yield scrapy.Request(url, callback = self.parse)</span>
        <span style="color: rgba(0, 128, 0, 1)">#</span><span style="color: rgba(0, 128, 0, 1)">url = "http://www.renren.com/410043129/profile"</span>
        <span style="color: rgba(0, 0, 255, 1)">yield</span> scrapy.FormRequest(url, cookies = self.cookies, callback =<span style="color: rgba(0, 0, 0, 1)"> self.parse_page)

</span><span style="color: rgba(0, 0, 255, 1)">def</span><span style="color: rgba(0, 0, 0, 1)"> parse_page(self, response):
    </span><span style="color: rgba(0, 0, 255, 1)">print</span> <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">===========</span><span style="color: rgba(128, 0, 0, 1)">"</span> +<span style="color: rgba(0, 0, 0, 1)"> response.url
    with open(</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">deng.html</span><span style="color: rgba(128, 0, 0, 1)">"</span>, <span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(128, 0, 0, 1)">w</span><span style="color: rgba(128, 0, 0, 1)">"</span><span style="color: rgba(0, 0, 0, 1)">) as filename:
        filename.write(response.body)</span></pre>

View Code

scrapy框架使用教程

获取 setting.py 中的值

获取图片的 url 地址

下载图片

保存 item 到 json 文件

保存到 mysql 中（两种方法）

优化 item 类（重要）

获取一个页面的全部 url

日志的使用

logging 设置

保存到 mongdb 数据库

下载中间件，随机更换 user-Agent 和 ip

登陆的三种方法

相关帖子

随便看看

scrapy框架使用教程