Pythonで機械学習を学ぶ重回帰｜タカの技術ブログ

単回帰（説明変数が1つ）を拡張し、説明変数が複数の場合に対応する。

サンプルコードでは、以下のモデルを構築します。
目的変数：単位面積の住宅価格
説明変数：築年数、最寄り駅までの距離、徒歩での生活圏内のコンビニの数

以下の台湾の不動産のデータセットで実験します。

archive.ics.uci.edu

外部リンク

UCI Machine Learning Repository: Real estate valuation data set Data Set

https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set#

データセットは、台湾新北市新店區から収集したもの。不動産評価は回帰問題である。
入力は以下の通り。
X1=取引日（例：2013.250=2013年3月）
X2=築年数（単位：年）
X3=最寄りの駅までの距離（単位：メートル）
X4＝徒歩での生活圏内のコンビニエンスストアの数（整数）
X5＝地理座標、緯度 (単位：度)
X6＝地理座標、経度 (単位：度)
出力は以下の通り。
Y=単位面積の住宅価格(10000台湾ドル/平、平は現地単位、1平＝3.3m2乗)
https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set#

準備

pip install pandas
pip install requests
pip install xlrd
pip install sklearn

データの読み込みと確認と整理

import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 読み込み
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx'
res = requests.get(url).content

data = pd.read_excel(res, sheet_name=0, encoding="utf-8", header=0)

data.columns = ['No', 'X1 transaction date', 'X2 house age', 'X3 distance to the nearest MRT station',
                'X4 number of convenience stores', 'X5 latitude', 'X6 longitude', 'Y house price of unit area']

# 確認
print('データ形式:{}'.format(data.shape))
print(data.head())

# 使うものに絞る
data = data[['X2 house age', 'X3 distance to the nearest MRT station',
             'X4 number of convenience stores', 'Y house price of unit area']]

# データ形式の確認
print(data.dtypes)

# 相関を確認
print(data.corr())

モデルの構築と評価

# 目的変数、説明変数
X = data.drop('Y house price of unit area', axis=1)
y = data['Y house price of unit area']

# 訓練データ、テストデータの準備
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

# 学習
model = LinearRegression()
model.fit(X_train, y_train)

# 結果
print('train:{:.3f}'.format(model.score(X_train, y_train)))
print('test:{:.3f}'.format(model.score(X_test, y_test)))