[Python] How to Do 'Scraping' to Extract Only Necessary Data from Websites Using BeautifulSoup

I’ll introduce how to extract only necessary data from web pages using Python.

This “extracting only necessary data from websites” is called “scraping.”

"Scraping" = Extracting and utilizing only the necessary parts of data from websites

Keep this in mind, and then if you search for “Python scraping,” you’ll find many helpful reference articles.

I immediately tried it out referring to the following site:

Using BeautifulSoup on Google App Engine - Web Job Hunting Diary

It seems you can do it easily using a Python library called BeautifulSoup. Amazing.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.yahoo.co.jp"
htmlfp = urllib2.urlopen(url)
html = htmlfp.read().decode("utf-8", "replace")
htmlfp.close()

soup = BeautifulSoup(html)
for link in soup.findAll("a"):
  print link

This program extracts and prints only links like <a href=“http://www.hogehoge.com/">~~~</a> from Yahoo! JAPAN (http://www.yahoo.co.jp/).

The execution results are as follows (partial excerpt):

ヘルプ
天気、交通情報ほか、連休お役立ち情報
「あいのり2」バングラデシュ編ついに完結
東日本大震災 チャリティーオークション
ショッピング
オークション
旅行、出張
ニュース
天気
スポーツ
ファイナンス
テレビ
地図
路線
グルメ
.
. (omitted)
.
会社概要
投資家情報
社会的責任
企業行動憲章
広告掲載について
採用情報
利用規約
セキュリティーの考え方
プライバシーポリシー
免責事項

That’s all from the Gemba.