html语法

<标签 属性=“值” 属性=“值”>被标记内容
</标签>

什么是bs4

bs4全称：beautifulsoup4，可以解析和提取网页中的数据，但需要使用特定的一些语法

bs4安装

pip install bs4

从bs4中查找数据的方法

1.find(标签，属性=值) 查找一个

举例：find(“table”,id=“3”) 查找一个id=3的内容，相当于查找一个html中<table id="3">xxxxx<table/>

2.find_all(标签，属性=值) 和find用法一致，只是能够用于查找所有值。

bs4的基本使用

使用bs4对数据进行解析主要通过两个步骤

1.把页面源代码交给beautifulsoup进行处理，生成bs对象

page = BeautifulSoup（resp.text，“html.parser”）

html.parser用来指定html解析器，相当于告诉bs4我提供的内容就是属于html内容的。

2.从bs对象中查找数据

page.find("table",class_="hq_table")

由于class是python的关键词，如果想要查找class是html中的class关键词，bs4提供一种方式区分python关键字和html关键字：可以在class的后面加“_”。

同样可以采用另一种写法来区别关键字：

page.find("table",attrs={"class":"hq_table"})

3.拿取数据

使用.text获取数据字段

例如：name = tds[0].text

实例：使用bs4爬取优美图库图片

思路

1.拿到主页面的源代码，然后提取到子页面的链接地址href
在这里插入图片描述
在网页中查看源代码，先搜索关键词“黑白冷淡风欧美图片”，发现源代码中可以找到相应结果，说明该网页是属于服务器渲染。

2.通过href拿到子页面的内容，从子页面中找到图片的下载地址
通过上述的源代码，可以找到href定位到该图片的子页面
在这里插入图片描述
在子页面中查看源代码，发现该图片的下载地址（img -> src）

3.下载图片

代码

import requests
from bs4 import BeautifulSoup
import timeurl = "https://www.umei.cc/weimeitupian/"
resp = requests.get(url)
resp.encoding = 'utf-8' #处理乱码
#print(resp.text)#把源代码交给bs
main_page = BeautifulSoup(resp.text,"html.parser")
alist = main_page.find("div",class_="taotu-main").find_all("a")
#print(alist)
for a in alist:href = a.get('href') #直接通过get来获得属性为href的值child_href = "https://www.umei.cc" + href#print(child_href)#拿到子页面的源代码child_page_resp = requests.get(child_href)child_page_resp.encoding = 'utf-8'child_page_content = child_page_resp.text#从子页面中获取下载路径child_page = BeautifulSoup(child_page_content,"html.parser")big_pic = child_page.find("div",class_="big-pic")img = big_pic.find("img")src = img.get("src")#print(src)child_page_resp.close()#下载图片img_resp = requests.get(src)#img_resp.content拿到的是字节img_name = src.split("/")[-1] #拿到url最后一个/后的内容with open("img/" + img_name,mode="wb") as f :f.write(img_resp.content) #图片内容写入文件print("over!",img_name)time.sleep(20)img_resp.close()
print("all over!")
resp.close()