Web Spider NEX XX国际货币经纪 - PDF下载 & 提取关键词（二）

03-09 1334阅读 0评论

Web Spider NEX XX国际货币经纪 - PDF下载 & 解析

首先声明: 此次案例只为学习交流使用，切勿用于其他非法用途

$Web Spider NEX XX国际货币经纪 - PDF下载 & 提取关键词（二）,Web Spider NEX XX国际货币经纪 - PDF下载 & 提取关键词（二）,词库加载错误:未能找到文件“C:\Users\Administrator\Desktop\火车头9.8破解版\Configuration\Dict_Stopwords.txt”。,使用,我们,安装,第1张$

（图片来源网络，侵删）

文章目录

Web Spider NEX XX国际货币经纪 - PDF下载 & 解析
前言
一、任务说明

1.PDF下载
2.PDF解析提取关键词数据
二、Pip模块安装
三、网站分析
四、核心代码注释

1.创建2019年1月1日-至今的时间字符串，存入列表中
2.pdf下载
3.pdf读取解析
五、运行结果
六、示例代码

总结

前言

目标网站：https://www.cfets-nex.com.cn/

提示：以下是本篇文章正文内容，下面案例可供参考

一、任务说明

1.PDF下载

提示：下载2019年1月1日-至今的"银行间货币市场"PDF文件

下图网址：https://www.cfets-nex.com.cn/Market/marketOverview/dailyReview

Web Spider NEX XX国际货币经纪 - PDF下载 & 提取关键词（二）

2.PDF解析提取关键词数据

提取关键词数据说明

提取下图标红框处位置的内容，如果不存在则赋值"None"；
红框处1：以"今日资金面"开头，句号(。)结束；
红框处2：以"资金面情绪指数"开头，换行(\n)结束；

以上为主要的提取部分，有些开头的关键词不同，需要另外写点匹配规则，参考案例；

提示：如果有更好的提取方式可以在评论处留言或者私信我，让我们在IT社区平台共同进步，感谢!

二、Pip模块安装

镜像地址

清华：https://pypi.tuna.tsinghua.edu.cn/simple
阿里云：http://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
华中理工大学：http://pypi.hustunique.com/
山东理工大学：http://pypi.sdutlinux.org/

豆瓣：http://pypi.douban.com/simple/

案例使用到的模块以及对应版本

（图片来源网络，侵删）

pandas==1.1.3
PyPDF2==2.12.1

requests==2.27.0

pip指定模块安装：pip install 模块名 -i https://pypi.tuna.tsinghua.edu.cn/simple

pip指定requirements.txt文件安装：pip install -i https://pypi.doubanio.com/simple/ -r requirements.txt

三、网站分析

1、打开链接，可以发现一个规律，每天收盘日评的网站链接是由相应的日期字符串组成；

链接后面的时间字符串为：2022/12/15

链接后面的时间字符串为：2022/12/16

2、按F12进入开发者模式，可以直接看到PDF的链接，直接请求网站就完事了；

标签a的href：/Cms_Data/Contents/Site2019/Folders/Daily/~contents/XBVJCVJ4Q8QG9A9L/MM.pdf

根据经验前缀需要加上：https://www.cfets-nex.com.cn

组合后可以直接打开PDF：https://www.cfets-nex.com.cn/Cms_Data/Contents/Site2019/Folders/Daily/~contents/XBVJCVJ4Q8QG9A9L/MM.pdf

四、核心代码注释

1.创建2019年1月1日-至今的时间字符串，存入列表中

import datetime
start_string = '2019-01-01'
def create_date_list():
    start_date = datetime.datetime.strptime(start_string , "%Y-%m-%d")  # 将指定的字符串转为时间格式
    now_date = (datetime.datetime.now()).strftime("%Y-%m-%d")  # 获取当前的时间
    date_string_list = list()
    i = 0
    while True:
        date_i = (start_date + datetime.timedelta(days=i)).strftime('%Y-%m-%d')
        date_string = str(date_i).replace('-', '/')
        print("创建时间字符串 - 存储成功：", date_string)
        date_string_list.append(date_string)
        if date_i  
2.pdf下载 
import requests
headers = {
	'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
def pdf_download():
    file_path = "result.pdf"
    pdf_url = 'https://xxxx.pdf'
    response = requests.get(url=pdf_url, headers=headers, timeout=5)
    with open(file_path, 'wb') as fis:
        for chunk in response.iter_content(chunk_size=1000):
            fis.write(chunk)
            fis.flush()
        print(f'下载完成：{file_path}')
    return True
 
3.pdf读取解析 
import PyPDF2
pdffile = open(file=file_path, mode='rb')  # 读取pdf文件;
pdfreader = PyPDF2.PdfFileReader(pdffile)
pdf_content = ''
for i in range(pdfreader.numPages):  # 获取pdf的总页数;
    page_content = pdfreader.getPage(i)  # 获取第i页的对象;
    pdf_content += page_content.extractText()  # 提取第i页的对象内容,字符串类型;
parse(pdf_content) # 自定义一个解析内容的方法，根据自己的需求提取相应的内容;
 
五、运行结果 
 
六、示例代码 
import os
import re
import time
import PyPDF2
import datetime
import requests
import pandas as pd
from requests import exceptions as request_exceptions
class SHICEconomy(object):
    def __init__(self):
        self.headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/95.0.4638.69 Safari/537.36'
        }
        self.start_string = '2019-01-01'
        self.resource_path = 'resource'
        self.result_file_path = 'result.csv'
    def create_date_list(self):
        start_date = datetime.datetime.strptime(self.start_string, "%Y-%m-%d")  # 将指定的字符串转为时间格式
        now_date = (datetime.datetime.now()).strftime("%Y-%m-%d")  # 获取当前的时间
        date_string_list = list()
        i = 0
        while True:
            date_i = (start_date + datetime.timedelta(days=i)).strftime('%Y-%m-%d')
            date_string = str(date_i).replace('-', '/')
            print("创建时间字符串 - 存储成功：", date_string)
            date_string_list.append(date_string)
            if date_i  
总结 
此次案例只为学习交流使用，若有侵犯网站利益的地方请及时联系我下架该博文；
在此我抛出两个问题，欢迎在评论区讨论或者私信我，感谢赐教！：
问题1：如何通过requests请求pdf链接拿到二进制内容后直接使用pdf解析模块进行解析；
问题2：如何以更好的方式提取pdf的关键词内容；