Python判斷網頁編碼

知識 07-22

有一種渴，只有酒才能滋潤，這種渴就是孤獨。

根據網頁返回編碼尋找數據

比如我要找到這個網頁的標題，那麼直接正則匹配(.*?)就可以，但是許多時候因為編碼問題requests這個庫沒辦法正確解析，所以獲取不到數據。

解決辦法：

r_port_top = requests.get(url=str("http://"+url), headers=headers, timeout=5)

if r_port_top.encoding == "ISO-8859-1":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

這種辦法就是先判斷網頁的編碼，然後轉換之。但是有的時候是utf-8編碼就沒辦法，接下來來個終極版的。

try:

UA = random.choice(headerss)

headers = {"User-Agent": UA}

r_port_top = requests.get(url=str("http://"+url), headers=headers, timeout=5)

if r_port_top.encoding == "ISO-8859-1":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

elif r_port_top.encoding == "GB2312":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

elif r_port_top.encoding == "gb2312":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

elif r_port_top.encoding == "GBK":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

elif r_port_top.encoding == "gbk":

encodings = requests.utils.get_encodings_from_content(r_port_top.text)

if encodings:

encoding = encodings[0]

else:

encoding = r_port_top.apparent_encoding

encode_content = r_port_top.content.decode(encoding, "replace").encode("utf-8", "replace")

port_title = re.search("<title>(.*?)</title>", encode_content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

else:

port_title = re.search("<title>(.*?)</title>", r_port_top.content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

except:

try:

port_title = re.search("<title>(.*?)</title>", r_port_top.content, re.S).group().replace("<title>",

"").replace(

"</title>", "")

except:

port_title = "暫時無法獲取網站標題"

使用chardet直接判斷轉換

上面那個方法實在是太傻了，使用chardet輕鬆解決網頁編碼問題。

# -*- coding: utf-8 -*-

# @Time : 2018/5/4 0004 8:55

# @Author : Langzi

# @Blog : www.langzi.fun

# @File : get urls.py

# @Software: PyCharm

import sys

import chardet

import re

import requests

reload(sys)

sys.setdefaultencoding("utf-8")

url = "https://stackoverflow.com"

d1 = requests.get(url)

print d1.content

if isinstance(d1.content,unicode):

pass

else:

codesty = chardet.detect(d1.content)

a = d1.content.decode(codesty["encoding"])

得到的a就是網頁最終編碼後的結果，這個時候直接re.search(『(.*?)『,a)就可以達到了匹配所有網址的標題了。

Python判斷網頁編碼

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 程序員小新人學習 的精彩文章:

※dubbo源碼分析之遠程調用概述
※jQuery之監聽input、textarea輸入框值變化

TAG:程序員小新人學習 |