Python判斷網頁編碼
有一種渴,只有酒才能滋潤,這種渴就是孤獨。
根據網頁返回編碼尋找數據
比如我要找到這個網頁的標題,那麼直接正則匹配(.*?)就可以,但是許多時候因為編碼問題requests這個庫沒辦法正確解析,所以獲取不到數據。
解決辦法:
r_port_top = requests.get(url=str("http://"+url), headers=headers, timeout=5)
if r_port_top.encoding == "ISO-8859-1":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
這種辦法就是先判斷網頁的編碼,然後轉換之。但是有的時候是utf-8編碼就沒辦法,接下來來個終極版的。
try:
UA = random.choice(headerss)
headers = {"User-Agent": UA}
r_port_top = requests.get(url=str("http://"+url), headers=headers, timeout=5)
if r_port_top.encoding == "ISO-8859-1":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
elif r_port_top.encoding == "GB2312":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
elif r_port_top.encoding == "gb2312":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
elif r_port_top.encoding == "GBK":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
elif r_port_top.encoding == "gbk":
encodings = requests.utils.get_encodings_from_content(r_port_top.text)
if encodings:
encoding = encodings[0]
else:
encoding = r_port_top.apparent_encoding
encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")
port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
else:
port_title = re.search("<title>(.*?)</title>", r_port_top.content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
except:
try:
port_title = re.search("<title>(.*?)</title>", r_port_top.content, re.S).group().replace("<title>",
"").replace(
"</title>", "")
except:
port_title = "暫時無法獲取網站標題"
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
使用chardet直接判斷轉換
上面那個方法實在是太傻了,使用chardet輕鬆解決網頁編碼問題。
# -*- coding: utf-8 -*-
# @Time : 2018/5/4 0004 8:55
# @Author : Langzi
# @Blog : www.langzi.fun
# @File : get urls.py
# @Software: PyCharm
import sys
import chardet
import re
import requests
reload(sys)
sys.setdefaultencoding("utf-8")
url = "https://stackoverflow.com"
d1 = requests.get(url)
print d1.content
if isinstance(d1.content,unicode):
pass
else:
codesty = chardet.detect(d1.content)
a = d1.content.decode(codesty["encoding"])
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
得到的a就是網頁最終編碼後的結果,這個時候直接re.search(『(.*?)『,a)就可以達到了匹配所有網址的標題了。
※dubbo源碼分析之遠程調用概述
※jQuery之監聽input、textarea輸入框值變化
TAG:程序員小新人學習 |