怎么用WebScraping爬取HTML網頁

這篇文章主要講解了“怎么用Web Scraping爬取HTML網頁”，文中的講解內容簡單清晰，易于學習與理解，下面請大家跟著小編的思路慢慢深入，一起來研究和學習“怎么用Web Scraping爬取HTML網頁”吧！

十多年的茂南網站建設經驗，針對設計、前端、開發、售后、文案、推廣等六對一服務，響應快，48小時及時工作處理。網絡營銷推廣的優勢是能夠根據用戶設備顯示端的尺寸不同，自動調整茂南建站的顯示方式，使網站能夠適用不同顯示終端，在瀏覽器中調整網站的寬度，無論在任何一種瀏覽器上瀏覽網站，都能展現優雅布局與設計，從而大程度地提升瀏覽體驗。成都創新互聯從事“茂南網站設計”,“茂南網站推廣”以來，每個客戶項目都認真落實執行。

-爬取HTML網頁

-直接下載數據文件，例如csv，txt，pdf文件

-通過應用程序編程接口（API）訪問數據，例如電影數據庫，Twitter

選擇網頁爬取，當然了解HTML網頁的基本結構，可以參考這個網頁：

HTML的基本結構

HTML標記：head，body，p，a，form，table等等

標簽會具有屬性。例如，標記a具有屬性（或屬性）href的鏈接的目標。

class和id是html用來通過級聯樣式表（CSS）控制每個元素的樣式的特殊屬性。 id是元素的唯一標識符，而class用于將元素分組以進行樣式設置。

一個元素可以與多個類相關聯。這些類別之間用空格隔開，例如 <h3 class=“ city main”>倫敦</ h3>

下圖是來自W3SCHOOL的例子，city的包括三個屬性，main包括一個屬性，London運用了兩個city和main，這兩個類，呈現出來的是下圖的樣子。

可以通過標簽相對于彼此的位置來引用標簽

child-child是另一個標簽內的標簽，例如這兩個p標簽是div標簽的子標簽。

parent-parent是一個標簽，另一個標簽在其中，例如 html標簽是body標簽的parent標簽。

siblings-siblings是與另一個標簽具有相同parent標簽的標簽，例如在html示例中，head和body標簽是同級標簽，因為它們都在html內。兩個p標簽都是sibling，因為它們都在body里面。

四步爬取網頁：

第一步：安裝模塊

安裝requests,beautifulsoup4,用來爬取網頁信息

Install modules requests, BeautifulSoup4/scrapy/selenium/....requests: allow you to send HTTP/1.1 requests using Python. To install:Open terminal (Mac) or Anaconda Command Prompt (Windows)code: BeautifulSoup: web page parsing library, to install, use:

第二步：利用安裝包來讀取網頁源碼

第三步：瀏覽網頁源碼找到需要讀取信息的位置

這里不同的瀏覽器讀取源碼有差異，下面介紹幾個，有相關網頁查詢詳細信息。

Firefox: right click on the web page and select "view page source"Safari: please instruction here to see page source ()Ineternet Explorer: see instruction at

第四步：開始讀取

Beautifulsoup: 簡單那，支持CSS Selector, 但不支持 XPathscrapy (): 支持 CSS Selector 和XPathSelenium: 可以爬取動態網頁（例如下拉不斷更新的）lxml等BeautifulSoup里Tag: an xml or HTML tag 標簽Name: every tag has a name 每個標簽的名字Attributes: a tag may have any number of attributes. 每個標簽有一個到多個屬性 A tag is shown as a dictionary in the form of {attribute1_name:attribute1_value, attribute2_name:attribute2_value, ...}. If an attribute has multiple values, the value is stored as a listNavigableString: the text within a tag

上代碼：

#Import requests and beautifulsoup packages

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity="all"

# import requests package

import requests

# import BeautifulSoup from package bs4 (i.e. beautifulsoup4)

from bs4 import BeautifulSoup

Get web page content

# send a get request to the web page

page=requests.get("A simple example page")

# status_code 200 indicates success.

# a status code >200 indicates a failure

if page.status_code==200:

# content property gives the content returned in bytes

print(page.content) # text in bytes

print(page.text) # text in unicode

#Parse web page content

# Process the returned content using beautifulsoup module

# initiate a beautifulsoup object using the html source and Python’s html.parser

soup=BeautifulSoup(page.content, 'html.parser')

# soup object stands for the **root**

# node of the html document tree

print("Soup object:")

# print soup object nicely

print(soup.prettify())

# soup.children returns an iterator of all children nodes

print("\soup children nodes:")

soup_children=soup.children

print(soup_children)

# convert to list

soup_children=list(soup.children)

print("\nlist of children of root:")

print(len(soup_children))

# html is the only child of the root node

html=soup_children[0]

html

# Get head and body tag

html_children=list(html.children)

print("how many children under html? ", len(html_children))

for idx, child in enumerate(html_children):

print("Child {} is: {}\n".format(idx, child))

# head is the second child of html

head=html_children[1]

# extract all text inside head

print("\nhead text:")

print(head.get_text())

# body is the fourth child of html

body=html_children[3]

# Get details of a tag

# get the first p tag in the div of body

div=list(body.children)[1]

p=list(div.children)[1]

# get the details of p tag

# first, get the data type of p

print("\ndata type:")

print(type(p))

# get tag name (property of p object)

print ("\ntag name: ")

print(p.name)

# a tag object with attributes has a dictionary

# use <tag>.attrs to get the dictionary

# each attribute name of the tag is a key

# get all attributes

p.attrs

# get "class" attribute

print ("\ntag class: ")

print(p["class"])

# how to determine if 'id' is an attribute of p?

# get text of p tag

p.get_text()

感謝各位的閱讀，以上就是“怎么用Web Scraping爬取HTML網頁”的內容了，經過本文的學習后，相信大家對怎么用Web Scraping爬取HTML網頁這一問題有了更深刻的體會，具體使用情況還需要大家實踐驗證。這里是創新互聯，小編將為大家推送更多相關知識點的文章，歡迎關注！

當前標題：怎么用WebScraping爬取HTML網頁
網站路徑：http://www.yijiale78.com/article24/ghdije.html

成都網站建設公司_創新互聯，為您提供靜態網站、網站內鏈、云服務器、域名注冊、虛擬主機、App開發

聲明：本網站發布的內容（圖片、視頻和文字）以用戶投稿、用戶轉載內容為主，如果涉及侵權請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網站立場，如需處理請聯系客服。電話：028-86922220；郵箱：631063699@qq.com。內容未經允許不得轉載，或轉載時需注明來源：創新互聯

猜你還喜歡下面的內容

99偷拍视频精品区一区二,口述久久久久久久久久久久,国产精品夫妇激情啪发布,成人永久免费网站在线观看,国产精品高清免费在线,青青草在线观看视频观看,久久久久久国产一区,天天婷婷久久18禁,日韩动漫av在线播放直播

怎么用WebScraping爬取HTML網頁