Pythonスクレイピングの導入と利用の注意ポイント

気象データの監視や研究用途、情報分析するといった際に必要なデータを収集するのにスクレイピングが活躍します。今回は、そのスクレイピングの導入・注意する点について調べました。スクレイピングをする際には気をつけることもあるので、その辺りも含め紹介します。

Pythonをインストールする

スクレイピングするのに必要なPythonをインストールします。今回は、`pyenv`というpythonを複数バージョン管理できるツールを使ってインストールします。既にHomebrewなどでpythonをインストールしている方は、アンインストールしておくことをお勧めします。

pyenvをインストールする

今回、準備する環境はこちらです。

pyenv: v1.2.1
python: v3.6.0

1. Homebrewを使ってpyenvをインストールします。pipenvも人気のようで気にはなるのですが、今回は割愛します。PATH設定の部分は、.bash_profile等に書いておくことをお勧めします。

$ brew install pyenv

$ export PATH="$HOME/.pyenv/bin:$PATH"
$ eval "$(pyenv init -)"

$ pyenv -h
Usage: pyenv <command></command> \[\]

Some useful pyenv commands are:
commands List all available pyenv commands
local Set or show the local application-specific Python version
global Set or show the global Python version
shell Set or show the shell-specific Python version
install Install a Python version using python-build
uninstall Uninstall a specific Python version
rehash Rehash pyenv shims (run this after installing executables)
version Show the current Python version and its origin
versions List all Python versions available to pyenv
which Display the full path to an executable
whence List all Python versions that contain the given executable

See \`pyenv help <command></command>' for information on a specific command.
<command></command> For full documentation, see: https://github.com/pyenv/pyenv#readme

$ pyenv -v
pyenv 1.2.1

2. pyenvをインストールしたら、`install` コマンドを使ってpythonをインストールします。

$ pyenv install 3.6.0
Installing python 3.6.0....

$ pyenv versions
\* system
  3.6.0 (set by ...)

3. アクティブなpythonのバージョンを3.6.0に切り替えます。

$ pyenv versions
\* system
  3.6.0 (set by ...)
      
$ python --version
Python 2.7.10
      
$ python global 3.6.0
      
$ python --version
Python 3.6.0

Beautiful Soupをインストールする

次は、取得してきたDOMデータを扱うためのPythonライブラリをインストールします。

1. pip3を使ってBeautiful Soupをインストール

$ pip3 install beautifulsoup4

2. Beautiful Soupがインストールされたか確認。エラーが生じなければインストール成功しています。

$ python3
Python 3.6.0 (default, Oct  8 2018, 21:45:07)
\[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)\] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> exit()

ページデータを取得する

次は、今取得したBeautiful Soupを使ってページデータを取得してDOMからデータを取得してみます。今回は、Yahoo! 天気からページ情報を取得します。

1. スクレイピングファイルを作成

/\* scraping.py \*/

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import URLError, HTTPError

try:
    html = urlopen("https://weather.yahoo.co.jp/weather")
    bsObj = BeautifulSoup(html, "html.parser")

    print(bsObj.html)
except HTTPError as e:
    print (e)
except URLError as e:
    print ("The server could not found.")
else:
    print ("It worked.")

2. 結果を表示

$ python3 scraping.py
<html lang="ja">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="text/css" http-equiv="Content-Style-Type"/>
<meta content="text/javascript" http-equiv="Content-Script-Type"/>
<meta content="天気予報,台風,地震,花粉,熱中症,豪雨,積雪" name="keywords"/>
<meta content="天気予報はもちろん、天気に関するあらゆる情報・災害情報を迅速にお伝えする天気・災害総合サイト。全国各地の雨雲の動きをリアルタイムにチェックできる「雨雲レーダー」や、花粉や熱中症、積雪情報など、季節ごとの天気情報も。" name="description"/>

...

</html>

ページからデータを取得する

scraping.pyにDOM操作を追加。

/\* scraping.py \*/

from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import URLError, HTTPError

try:
    html = urlopen("https://weather.yahoo.co.jp/weather")
    bsObj = BeautifulSoup(html, "html.parser")

    // print(bsObj.html) /\* 非表示 \*/
    print(bsObj.h1.text) /\* 追加 \*/
    print(bsObj.h2.text) /\* 追加 \*/
except HTTPError as e:
    print (e)
except URLError as e:
    print ("The server could not found.")
else:
    print ("It worked.")

2. 結果を表示

$ python3 scraping.py
\[h1\] 全国の天気
\[h2\] 全国概況
It worked.

実行すると、このような感じで取得することができます。今回は１ページだけ取得対象にしましたが、複数ページから取得することもできます。

今回のサンプルコードはこちらです。

注意ポイント

スクレイピングには気をつけるポイントがいくつかあります。
事件化してしまったケースもあります。Librahack

実装次第では、DOSのように負荷を与えてしまう可能性がある
robots.txt、利用規約等に注意する
データの利用目的に気をつける

細かいところは、こちらのサイトが参考になると思います。

今回は、こちらで紹介している本を参考にしています。

まとめ

今回は、スクレイピング導入・注意ポイントについてまとめました。スクレイピングを利用することで、多くのデータを収集・分析して有用な情報を効率的に集めることができる強力な手法の１つです。容量・用法には注意して技術を活用しましょう。