Selenium抓取网页教程

越来越多的网页经过异步加载或者加密处理，无法使用scrapy 直接抓取，需要模拟浏览器的渲染、JS引擎甚至鼠标键盘事件来进行，这就需要强大的selenium了。

安装selenium

selenium 提供了一套标准的Browser Automation方案，被Chrome、Firefox、Safari、Edge所支持，其WebDriver方案成为W3C的推荐标准。

Python可以直接pip安装selenium：

pip install selenium

安装ChromeDriver和Chrome

使用前，需要先下载对应浏览器的webdriver，可以到selenium提供的地址下载，下载后请把driver放到PATH包含的某个路径下。

先从官网下载chrome浏览器的deb包，安装浏览器：

sudo dpkg -i google-chrome-stable_current_amd64.deb

如果提示依赖失败，可以执行以下神器命令：

sudo apt --fix-broken install

即可一个命令完成依赖安装。测试以下：

google-chrome-stable --no-sandbox --headless --disable-gpu --screenshot https://www.baidu.com

然后下载chromedriver，放在PATH中的某个路径下。

Debian 无 X 环境headless启动Chrome：

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=options, service_log_path = '/tmp/chromedriver.log')
browser.get('https://www.baidu.com')
#...
browser.quit()

安装geckodriver和Firefox

sudo apt-get install firefox-esr
wget https://github.com/mozilla/geckodriver/releases/download/v0.29.1/geckodriver-v0.29.1-linux64.tar.gz
tar zxf geckodriver-v0.29.1-linux64.tar.gz

测试一下Firefox：

firefox --screenshot a.png https://www.baidu.com

经过试用，headless下Firefox会crash，稳定性不如Chrome，建议优选Chrome。

Debian下无X环境headless启动Firefox：

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument("-headless")
browser = webdriver.Firefox(options=options)
browser.get('https://www.baidu.com')
#...
browser.quit()

Selenium基础使用

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

browser.get('https://www.baidu.com')
WebDriverWait(browser, 20).until(wait_func)
js = "window.scrollTo(0,document.body.scrollHeight); return document.body.innerHTML"
html = browser.execute_script(js) #rendered page
source = browser.page_source # page source

#建议使用BeautifulSoup来解析HTML
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(html))
o = soup.find('div')
o = soup.find(class_='top_div')
print(o.text)

browser.quit()

Selenium抓取网页教程

安装selenium

安装ChromeDriver和Chrome

安装geckodriver和Firefox

Selenium基础使用

相关文章