Python爬取网站内容做静态页面转发
前言
由于本网站部署的问题(云服务+本地部署),本地可能有多种原因失联。为了在这种情况下网站依然能被用户查看,所以将网站内容爬取后静态保存,然后利用Nginx的异常状态捕捉,当服务不可用时将URI对应的静态文件返回给浏览器。
Python爬取页面
以下使用了Python的scrapy框架,根据官网的Scrapy Tutorial提供的scrapy startproject tutorial命令可以快速构建一个爬虫项目,创建自己的爬虫文件并编写相关逻辑后执行scrapy crawl blog命令启动爬虫,爬虫文件参考如下:
from pathlib import Path
import scrapy
import logging
from typing import TYPE_CHECKING, Any, Iterable, List, Optional, Union, cast
# 命令行启动爬虫命令 scrapy crawl crawlTest
class QuotesSpider(scrapy.Spider):
# 调整以下变量
# 爬虫名
name = "crawlTest"
# 爬取开始地址
start_urls = [
"https://xxx.xxx.com",
]
# 下载域名
downloadHost = "xxx.xxx.com"
# 保存目录
saveDir = "blog"
# 已访问列表
accessedList = []
# 已下载列表
downloadedList = []
def parse(self, response):
self.log("访问pares:" + response.url)
# 添加到已访问集合
self._addAccess(response.url)
# 下载当前页面
self.download(response)
# 资源下载
# 图片下载
images = response.css("img::attr(src)")
self.log("images :" + str(images))
for item in images:
s_item = str(item)
if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
yield response.follow(s_item, self.download)
# link(css)下载
links = response.css("link::attr(href)")
self.log("links :" + str(links))
for item in links:
# 移除url后的参数
s_item = self._removeParameter(str(item))
if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
yield response.follow(s_item, self.download)
# script下载
scripts = response.css("script::attr(src)")
self.log("scripts :" + str(scripts))
for item in scripts:
# 移除url后的参数
s_item = self._removeParameter(str(item))
if s_item.__contains__(self.downloadHost) and not self._isDownloaded(s_item):
yield response.follow(s_item, self.download)
# 其他页面爬取
aList = response.css("a::attr(href)")
self.log("")
for a in aList:
stra = str(a)
if stra.__contains__(self.downloadHost):
if self._isAccessed(stra):
self.log("已下载过的本站地址:" + stra)
continue
self.log("可下载的本站地址:"+stra)
else:
self.log("不可下载的地址:" + stra)
continue
yield response.follow(a, self.parse)
def download(self, response):
self._download(response.url, response.body)
# filename = f"quotes-{page}.html"
# Path(filename).write_bytes(response.body)
def _download(self, url, body):
if self._isDownloadedAndSave(url):
return
# https://xxx.xxx.com/xx/xx/
url = self._switchHtml(url)
# self.log("进行页面下载流程,下载文件:"+url)
# 获取https://后的内容
paths = url.split("/")[2:]
if paths[0].__eq__(self.downloadHost):
self.log("下载页面:" + url)
else:
return
# 去除host
paths = paths[1:]
for index, name in enumerate(paths):
path = self._getMainDir() + "/" + "/".join(paths[0:index + 1])
file = Path(path)
if name.__eq__(paths[-1]):
# self.log("file:" + name)
# if file.exists():
# self.log("warn: file exists " + path)
# continue
file.write_bytes(body)
else:
# self.log("dir:" + name)
if file.exists():
continue
file.mkdir()
def _getMainDir(self):
dir = Path(self.saveDir)
if not dir.exists():
dir.mkdir()
return self.saveDir
def _switchHtml(self, url):
# /与host结尾则为页面
if url.endswith("/"):
url = url + "index.html"
elif url.endswith(self.downloadHost):
url = url + "/index.html"
return url
def _isDownloaded(self, url):
if url in self.downloadedList:
return True
return False
def _isDownloadedAndSave(self, url):
if self._isDownloaded(url):
return True
self.downloadedList.append(url)
return False
def _isAccessed(self, url):
if url in self.accessedList:
return True
return False
def _addAccess(self, url):
if not self._isAccessed(url):
self.accessedList.append(url)
def _removeParameter(self, url):
return url.split("?")[0]
def log(self, message: Any, level: int = logging.DEBUG, **kw: Any) -> None:
print(message)
if __name__ == '__main__':
q = QuotesSpider();
# q._download(q.downloadHost, b"aa")
# print(q._isDownloadedAndSave(q.downloadHost))
# print(q._removeParameter(q.downloadHost+"/wp-content/plugins/enlighter/cache/enlighterjs.min.css?ver=vo/Yz0k1HSy0Sr5"))
print(q.downloadHost+"/2024/04/28/wordpress-custome-css-define-website/".split("/"))
提交Github仓库
1.创建仓库 登录Github创建仓库,设置创建自动部署Pages,设置方式:Setting -> code and automation -> Pages -> Build and deployment -> branch下选择相关分支,参考文档:https://docs.github.com/en/pages/quickstart PS: 如果希望设置为当前账号主页,可以将仓库命名为“用户名.github.io
2.Git提交 将爬取下来的文件提交到Github上:
[root@iZbp1605iwejf5qgem2c7hZ ~]# cd blog [root@iZbp1605iwejf5qgem2c7hZ blog]# git init [root@iZbp1605iwejf5qgem2c7hZ blog]# git add . [root@iZbp1605iwejf5qgem2c7hZ blog]# git commit -m init [root@iZbp1605iwejf5qgem2c7hZ blog]# git remote add origin https://github.com/xxxxxx/xxxxxx.github.io.git [root@iZbp1605iwejf5qgem2c7hZ blog]# git checkout -b main [root@iZbp1605iwejf5qgem2c7hZ blog]# git push -u origin main
Ingress转发至静态地址
1.配置Ingress当服务不可用时转发Github备份页面:
# ingress snippets配置:
nginx.org/server-snippets: |
error_page 502 = @fallback;
location @fallback {
proxy_pass https://xxxxxx.github.io;
}
需要注意,snippets一旦配置有问题可能会导致整个集群网络异常!其中指定的域名必需是实时可解析的!所以不建议配置k8s service name 参考:Nginx-Ingress-Controller Snippets配置引起集群网络异常记录
2.需要将Nginx Ingress Controller配置到控制节点,以免工作节点异常导致不可用 2.1 查看node taints:
[root@iZbp1605iwejf5qgem2c7hZ blog]# kubectl describe node|grep role -n 8: node-role.kubernetes.io/control-plane= 18:Taints: node-role.kubernetes.io/control-plane:NoSchedule
2.2 在Nginx-Ingress-Controller Deployment配置污点容忍:
# 配置在spec.template.spec下 tolerations: - key: "node-role.kubernetes.io/control-plane" operator: "Equal" value: "" effect: "NoSchedule"
配置完成!当工作节点丢失后Nginx Ingress Controller会提示502网关错误,然后触发502转发,将Github上对应URI文件返回给浏览器。不过不知道是不是Github限流,页面加载有点慢!单文件加载还行,但是一个页面通常有几十个文件,在浏览器缓存了一些文件之后速度才好一些。 后续做了全自动处理,并部署静态网站到控制节点当前,速度飞快,相关文章:Github Action全自动发布静态网站