boston

首先从sklearn中导入数据

from sklearn.datasets import load_boston
import numpy as np

boston = load_boston()

boston.DESCR

Notes

Data Set Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
    - CRIM     per capita crime rate by town（城镇人均犯罪率）
    - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.（住宅用地的超过25,000平方英尺的比例）
    - INDUS    proportion of non-retail business acres per town（城镇非零售商业面积的比例）
    - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)（是否邻河，是1，其他0）
    - NOX      nitric oxides concentration (parts per 10 million)（氧化氮浓度(每千万分之一)）
    - RM       average number of rooms per dwelling（住宅平均房间数量）
    - AGE      proportion of owner-occupied units built prior to 1940（自1940年以前建造的自有住房的比例）
    - DIS      weighted distances to five Boston employment centres（到五个波士顿就业中心的加权距离）
    - RAD      index of accessibility to radial highways（径向高速公路可及性指数）
    - TAX      full-value property-tax rate per $10,000（1万美元的全价值财产税）
    - PTRATIO  pupil-teacher ratio by town（镇上学生与教师数量比例）
    - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town（社区内黑人比例）
    - LSTAT    % lower status of the population（2. `LSTAT`: 区域中被认为是低收入阶层的比率）
    - MEDV     Median value of owner-occupied homes in $1000's（房屋的中值价格）

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

References

Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

X = boston.data

y = boston.target

for i in range(5):
    print(X[i] ,'-->', y[i])

[  6.32000000e-03   1.80000000e+01   2.31000000e+00   0.00000000e+00
   5.38000000e-01   6.57500000e+00   6.52000000e+01   4.09000000e+00
   1.00000000e+00   2.96000000e+02   1.53000000e+01   3.96900000e+02
   4.98000000e+00] --> 24.0
[  2.73100000e-02   0.00000000e+00   7.07000000e+00   0.00000000e+00
   4.69000000e-01   6.42100000e+00   7.89000000e+01   4.96710000e+00
   2.00000000e+00   2.42000000e+02   1.78000000e+01   3.96900000e+02
   9.14000000e+00] --> 21.6
[  2.72900000e-02   0.00000000e+00   7.07000000e+00   0.00000000e+00
   4.69000000e-01   7.18500000e+00   6.11000000e+01   4.96710000e+00
   2.00000000e+00   2.42000000e+02   1.78000000e+01   3.92830000e+02
   4.03000000e+00] --> 34.7
[  3.23700000e-02   0.00000000e+00   2.18000000e+00   0.00000000e+00
   4.58000000e-01   6.99800000e+00   4.58000000e+01   6.06220000e+00
   3.00000000e+00   2.22000000e+02   1.87000000e+01   3.94630000e+02
   2.94000000e+00] --> 33.4
[  6.90500000e-02   0.00000000e+00   2.18000000e+00   0.00000000e+00
   4.58000000e-01   7.14700000e+00   5.42000000e+01   6.06220000e+00
   3.00000000e+00   2.22000000e+02   1.87000000e+01   3.96900000e+02
   5.33000000e+00] --> 36.2

print(len(boston))

print(X.shape)

print(y.shape)

4
(506, 13)
(506,)

观察特征和label的相关性

这里使用matplotlib来直观反应相关性

import matplotlib.pyplot as plt

X_labels = boston.feature_names

for i,name in enumerate(X_labels):
    X_i = X[:,i]
    plt.figure(i+1)
    plt.scatter(X_i,y)
    plt.xlabel(name)
    plt.ylabel('MEDV')
    plt.title(name)
plt.show()

根据数据可以找出出几条明细规律：

犯罪率越大，价格越大
邻河房子的起价更高
住家平均房价数量越多，价格越高
社区内低收入人群越高，房价越低

进行训练集和测试集划分

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=13)

print("Train test split success!")

Train test split success!

from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error

from sklearn import preprocessing

import matplotlib.pyplot as plt

# features_new2 = preprocessing.PolynomialFeatures().fit_transform(features)

linear = linear_model.LinearRegression(normalize=True)

methods=[
    linear_model.LinearRegression(normalize=True),
#     linear_model.Ridge(),# 岭回归 这个不适合这种欠拟合场景
    linear_model.Lasso(alpha = 0.01), # lasso
    linear_model.LassoLars(alpha=.1),
    linear_model.BayesianRidge(),
#     linear_model.Lars(), # 最小角回归
#     linear_model.LogisticRegression(),
    DecisionTreeRegressor(),
    AdaBoostRegressor(n_estimators=200, learning_rate=0.01), # adaboost 默认使用cart 回归树
    GradientBoostingRegressor(alpha=0.01),
#     linear_model.PassiveAggressiveRegressor()
]

for method in methods:
    method.fit(X_train, y_train)
    y_pred = method.predict(X_test)
    y_pred_all = method.predict(X)
    print(str(method)[:20], '...', method.score(X_test, y_test), mean_squared_error(y_test, y_pred))
    if method :
        for i in range(5):
            print(y_test[i],'-->',y_pred[i])
        plt.figure()
        plt.scatter(y_test,y_pred)
        plt.plot(y_test, y_test, color='red', linewidth=3)
        plt.show()
#     print(str(method)[:20], '_all...', method.score(features, label), mean_squared_error(label, y_pred_all))

LinearRegression(cop ... 0.731266198822 24.3636130537
12.0 --> 11.2372813213
15.2 --> 19.6569354177
21.0 --> 20.7727945311
24.0 --> 30.0178384426
19.4 --> 23.3468588414

png

Lasso(alpha=0.01, co ... 0.727369322035 24.7169068996
12.0 --> 10.997479607
15.2 --> 19.8015428984
21.0 --> 20.9103040897
24.0 --> 30.1566779847
19.4 --> 23.3037725316

png

LassoLars(alpha=0.1, ... 0.619695143322 34.4787307361
12.0 --> 13.5661711371
15.2 --> 21.0430044569
21.0 --> 23.4970861934
24.0 --> 27.8287636275
19.4 --> 22.2497974399

png

BayesianRidge(alpha_ ... 0.70544502235 26.7045807674
12.0 --> 10.3868573593
15.2 --> 20.0949244784
21.0 --> 21.1871783596
24.0 --> 30.7223799719
19.4 --> 23.332064073

png

DecisionTreeRegresso ... 0.835725689633 14.8932352941
12.0 --> 14.6
15.2 --> 16.1
21.0 --> 27.5
24.0 --> 32.0
19.4 --> 19.2

png

AdaBoostRegressor(ba ... 0.850469334357 13.5565651266
12.0 --> 12.168852459
15.2 --> 16.48
21.0 --> 22.2409326425
24.0 --> 26.2711111111
19.4 --> 21.6185897436

png

GradientBoostingRegr ... 0.908090516725 8.33258442609
12.0 --> 12.8209885964
15.2 --> 15.9543753266
21.0 --> 21.0684276931
24.0 --> 29.7570140491
19.4 --> 20.9639472861

png

可以看出，GDBT和adaboost都显示出了比较好的回归预测效果。

GradientBoostingRegr ... 0.904900131591 8.6218271956

AdaBoostRegressor(ba ... 0.85113787301 13.4959548983

导入r2_score

from sklearn.metrics import r2_score

def performance_metric(y_true, y_predict):
    """计算并返回预测值相比于预测值的分数"""
    score = r2_score(y_true, y_predict, sample_weight=None, multioutput=None)
    return score

from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

def fit_model(X, y):

    """ 基于输入数据 [X,y]，利于网格搜索找到最优的决策树模型"""

    cross_validator = KFold(n_splits=10, shuffle=False, random_state=None)

    regressor = DecisionTreeRegressor()

    params = {'max_depth':[1,2,3,4,5,6,7,8,9,10]}

    scoring_fnc = make_scorer(performance_metric)

    grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cross_validator)

    # 基于输入数据 [X,y]，进行网格搜索
    grid = grid.fit(X, y)
    print("best param" + str(grid.best_params_))
    print("best score" + str(grid.best_score_))
    # 返回网格搜索后的最优模型
    return grid.best_estimator_ 

# 基于训练数据，获得最优模型
optimal_reg = fit_model(X_train, y_train)

# 输出最优模型的 'max_depth' 参数
print("Parameter 'max_depth' is {} for the optimal model.".format(optimal_reg.get_params()['max_depth']))

best param{'max_depth': 9}
best score0.746367148236
Parameter 'max_depth' is 9 for the optimal model.

绘制学习曲线

了解整体训练误差下降趋势，观察正则化的影响

from sklearn.model_selection import learning_curve, validation_curve
from sklearn.model_selection import ShuffleSplit

def ModelLearning(X, y):
    cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)
    train_sizes = np.rint(np.linspace(1, X.shape[0]*0.8 - 1, 9)).astype(int)
    fig = plt.figure(figsize=(10,7))

    for k, depth in enumerate([1,3,6,10]):
        regressor = DecisionTreeRegressor(max_depth = depth)
        sizes, train_scores, valid_scores = learning_curve(regressor, X, y, \
            cv = cv, train_sizes = train_sizes, scoring = 'r2')
        train_std = np.std(train_scores, axis = 1)
        train_mean = np.mean(train_scores, axis = 1)
        valid_std = np.std(valid_scores, axis = 1)
        valid_mean = np.mean(valid_scores, axis = 1)
        ax = fig.add_subplot(2, 2, k+1)
        ax.plot(sizes, train_mean, 'o-', color = 'r', label = 'Training Score')
        ax.plot(sizes, valid_mean, 'o-', color = 'g', label = 'Validation Score')
        ax.fill_between(sizes, train_mean - train_std, \
            train_mean + train_std, alpha = 0.15, color = 'r')
        ax.fill_between(sizes, valid_mean - valid_std, \
            valid_mean + valid_std, alpha = 0.15, color = 'g')

        # Labels
        ax.set_title('max_depth = %s'%(depth))
        ax.set_xlabel('Number of Training Points')
        ax.set_ylabel('r2_score')
        ax.set_xlim([0, X.shape[0]*0.8])
        ax.set_ylim([-0.05, 1.05])

    ax.legend(bbox_to_anchor=(1.05, 2.05), loc='lower left', borderaxespad = 0.)
    fig.suptitle('Decision Tree Regressor Learning Performances', fontsize = 16, y = 1.03)
    fig.tight_layout()
    fig.show()


def ModelComplexity(X, y):
    """ Calculates the performance of the model as model complexity increases.
        The learning and validation errors rates are then plotted.
        随着模型复杂性的增加，计算模型的性能。
        然后绘制学习和验证错误的速率。 """

    # Create 10 cross-validation sets for training and testing
    cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)

    # Vary the max_depth parameter from 1 to 10
    max_depth = np.arange(1,11)

    # Calculate the training and testing scores
    train_scores, valid_scores = validation_curve(DecisionTreeRegressor(), X, y, \
        param_name = "max_depth", param_range = max_depth, cv = cv, scoring = 'r2')

    # Find the mean and standard deviation for smoothing
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    valid_mean = np.mean(valid_scores, axis=1)
    valid_std = np.std(valid_scores, axis=1)

    # Plot the validation curve
    plt.figure(figsize=(7, 5))
    plt.title('Decision Tree Regressor Complexity Performance')
    plt.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score')
    plt.plot(max_depth, valid_mean, 'o-', color = 'g', label = 'Validation Score')
    plt.fill_between(max_depth, train_mean - train_std, \
        train_mean + train_std, alpha = 0.15, color = 'r')
    plt.fill_between(max_depth, valid_mean - valid_std, \
        valid_mean + valid_std, alpha = 0.15, color = 'g')

    # Visual aesthetics
    plt.legend(loc = 'lower right')
    plt.xlabel('Maximum Depth')
    plt.ylabel('r2_score')
    plt.ylim([-0.05,1.05])
    plt.show()


def PredictTrials(X, y, fitter, data):
    """ Performs trials of fitting and predicting data. """

    # Store the predicted prices
    prices = []

    for k in range(10):
        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(X, y, \
            test_size = 0.2, random_state = k)

        # Fit the data
        reg = fitter(X_train, y_train)

        # Make a prediction
        pred = reg.predict([data[0]])[0]
        prices.append(pred)

        # Result
        print("Trial {}: ${:,.2f}".format(k+1, pred))

    # Display price range
    print("\nRange in prices: ${:,.2f}".format(max(prices) - min(prices)))


ModelLearning(X_train, y_train)
ModelComplexity(X_train, y_train)

D:\Program Files\Anaconda3\lib\site-packages\matplotlib\figure.py:418: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "



<matplotlib.figure.Figure at 0x1aa7eeca080>

png

FurtherGazer

FurtherGazer's blog

波士顿房价预测