2024 一天掌握python爬蟲【基礎(chǔ)篇】 涵蓋 requests、beautifulsoup、selenium:
https://www.bilibili.com/video/BV1Ju4y1Y7k6/
我們抓取下https://www.cnblogs.com/ 首頁所有的帖子信息,包括帖子標題,帖子地址,以及帖子作者信息。
首先用requests獲取網(wǎng)頁文件,然后再用bs4進行解析。
參考代碼:
import requests
url = "https://www.cnblogs.com/"
r = requests.get(url)
# 設(shè)置返回對象的編碼
r.encoding = "utf-8"
# print(r.text)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
article_list = soup.select("article.post-item")
# print(article_list)
for artile in article_list:
print("==========")
author = artile.find("a", class_="post-item-author")
print(author.get_text())
link = artile.find("a", class_="post-item-title")
print(link.get_text())
print(link.attrs["href"])