Python技术

Python爬取网站内容做静态页面转发

前言

由于本网站部署的问题(云服务+本地部署),本地可能有多种原因失联。为了在这种情况下网站依然能被用户查看,所以将网站内容爬取后静态保存,然后利用Nginx的异常状态捕捉,当服务不可用时将URI对应的静态文件返回给浏览器。

Python爬取页面

以下使用了Python的scrapy框架,根据官网的Scrapy Tutorial提供的scrapy startproject tutorial命令可以快速构建一个爬虫项目,创建自己的爬虫文件并编写相关逻辑后执行scrapy crawl blog命令启动爬虫,爬虫文件参考如下:
from pathlib import Path

import scrapy
import logging
from typing import TYPE_CHECKING, Any, Iterable, List, Optional, Union, cast

# 命令行启动爬虫命令 scrapy crawl crawlTest
class QuotesSpider(scrapy.Spider):
    # 调整以下变量
    # 爬虫名
    name = "crawlTest"
    # 爬取开始地址
    start_urls = [
        "https://xxx.xxx.com",
    ]
    # 下载域名
    downloadHost = "xxx.xxx.com"
    # 保存目录
    saveDir = "blog"

    # 已访问列表
    accessedList = []
    # 已下载列表
    downloadedList = []

    def parse(self, response):
        self.log("访问pares:" + response.url)
        # 添加到已访问集合
        self._addAccess(response.url)
        # 下载当前页面
        self.download(response)

        # 资源下载
        # 图片下载
        images = response.css("img::attr(src)")
        self.log("images :" + str(images))
        for item in images:
            s_item = str(item)
            if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
                yield response.follow(s_item, self.download)
        # link(css)下载
        links = response.css("link::attr(href)")
        self.log("links :" + str(links))
        for item in links:
            # 移除url后的参数
            s_item = self._removeParameter(str(item))
            if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
                yield response.follow(s_item, self.download)
        # script下载
        scripts = response.css("script::attr(src)")
        self.log("scripts :" + str(scripts))
        for item in scripts:
            # 移除url后的参数
            s_item = self._removeParameter(str(item))
            if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
                yield response.follow(s_item, self.download)

        # 其他页面爬取
        aList = response.css("a::attr(href)")
        self.log("")
        for a in aList:
            stra = str(a)
            if stra.__contains__(self.downloadHost):
                if self._isAccessed(stra):
                    self.log("已下载过的本站地址:" + stra)
                    continue

                self.log("可下载的本站地址:"+stra)
            else:
                self.log("不可下载的地址:" + stra)
                continue
            yield response.follow(a, self.parse)

    def download(self, response):
        self._download(response.url, response.body)
        # filename = f"quotes-{page}.html"
        # Path(filename).write_bytes(response.body)

    def _download(self, url, body):
        if self._isDownloadedAndSave(url):
            return
        # https://xxx.xxx.com/xx/xx/
        url = self._switchHtml(url)
        # self.log("进行页面下载流程,下载文件:"+url)

        # 获取https://后的内容
        paths = url.split("/")[2:]
        if paths[0].__eq__(self.downloadHost):
            self.log("下载页面:" + url)
        else:
            return

        # 去除host
        paths = paths[1:]

        for index, name in enumerate(paths):
            path = self._getMainDir() + "/" + "/".join(paths[0:index + 1])
            file = Path(path)
            if name.__eq__(paths[-1]):
                # self.log("file:" + name)
                # if file.exists():
                    # self.log("warn: file exists " + path)
                    # continue
                file.write_bytes(body)
            else:
                # self.log("dir:" + name)
                if file.exists():
                    continue
                file.mkdir()

    def _getMainDir(self):
        dir = Path(self.saveDir)
        if not dir.exists():
            dir.mkdir()
        return self.saveDir

    def _switchHtml(self, url):
        # /与host结尾则为页面
        if url.endswith("/"):
            url = url + "index.html"
        elif url.endswith(self.downloadHost):
            url = url + "/index.html"
        return url

    def _isDownloaded(self, url):
        if url in self.downloadedList:
            return True
        return False

    def _isDownloadedAndSave(self, url):
        if self._isDownloaded(url):
            return True
        self.downloadedList.append(url)
        return False

    def _isAccessed(self, url):
        if url in self.accessedList:
            return True
        return False

    def _addAccess(self, url):
        if not self._isAccessed(url):
            self.accessedList.append(url)

    def _removeParameter(self, url):
        return url.split("?")[0]

    def log(self, message: Any, level: int = logging.DEBUG, **kw: Any) -> None:
        print(message)


if __name__ == '__main__':
    q = QuotesSpider();
    # q._download(q.downloadHost, b"aa")
    # print(q._isDownloadedAndSave(q.downloadHost))
    # print(q._removeParameter(q.downloadHost+"/wp-content/plugins/enlighter/cache/enlighterjs.min.css?ver=vo/Yz0k1HSy0Sr5"))
    print(q.downloadHost+"/2024/04/28/wordpress-custome-css-define-website/".split("/"))

提交Github仓库

1.创建仓库
登录Github创建仓库,设置创建自动部署Pages,设置方式:Setting -> code and automation -> Pages -> Build and deployment -> branch下选择相关分支,参考文档:https://docs.github.com/en/pages/quickstart
PS: 如果希望设置为当前账号主页,可以将仓库命名为“用户名.github.io
2.Git提交
将爬取下来的文件提交到Github上:
[root@iZbp1605iwejf5qgem2c7hZ ~]# cd blog
[root@iZbp1605iwejf5qgem2c7hZ blog]# git init
[root@iZbp1605iwejf5qgem2c7hZ blog]# git add .
[root@iZbp1605iwejf5qgem2c7hZ blog]# git commit -m init
[root@iZbp1605iwejf5qgem2c7hZ blog]# git remote add origin https://github.com/xxxxxx/xxxxxx.github.io.git
[root@iZbp1605iwejf5qgem2c7hZ blog]# git checkout -b main
[root@iZbp1605iwejf5qgem2c7hZ blog]# git push -u origin main

Ingress转发至静态地址

1.配置Ingress当服务不可用时转发Github备份页面:
# ingress snippets配置:
    nginx.org/server-snippets: |
      error_page 502 = @fallback;
      location @fallback {
        proxy_pass https://xxxxxx.github.io;
      }
需要注意,snippets一旦配置有问题可能会导致整个集群网络异常!其中指定的域名必需是实时可解析的!所以不建议配置k8s service name
参考:Nginx-Ingress-Controller Snippets配置引起集群网络异常记录
2.需要将Nginx Ingress Controller配置到控制节点,以免工作节点异常导致不可用
2.1 查看node taints:
[root@iZbp1605iwejf5qgem2c7hZ blog]# kubectl describe node|grep role -n
8:                    node-role.kubernetes.io/control-plane=
18:Taints:             node-role.kubernetes.io/control-plane:NoSchedule
2.2 在Nginx-Ingress-Controller Deployment配置污点容忍:
# 配置在spec.template.spec下
	tolerations:  
	- key: "node-role.kubernetes.io/control-plane"  
	  operator: "Equal"  
	  value: ""  
	  effect: "NoSchedule"
配置完成!当工作节点丢失后Nginx Ingress Controller会提示502网关错误,然后触发502转发,将Github上对应URI文件返回给浏览器。不过不知道是不是Github限流,页面加载有点慢!单文件加载还行,但是一个页面通常有几十个文件,在浏览器缓存了一些文件之后速度才好一些。

后续做了全自动处理,并部署静态网站到控制节点当前,速度飞快,相关文章:Github Action全自动发布静态网站

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注