boston
首先从sklearn中导入数据
from sklearn.datasets import load_boston
import numpy as np
boston = load_boston()
boston.DESCR
Notes
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town(城镇人均犯罪率)
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.(住宅用地的超过25,000平方英尺的比例)
- INDUS proportion of non-retail business acres per town(城镇非零售商业面积的比例)
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)(是否邻河,是1,其他0)
- NOX nitric oxides concentration (parts per 10 million)(氧化氮浓度(每千万分之一))
- RM average number of rooms per dwelling(住宅平均房间数量)
- AGE proportion of owner-occupied units built prior to 1940(自1940年以前建造的自有住房的比例)
- DIS weighted distances to five Boston employment centres(到五个波士顿就业中心的加权距离)
- RAD index of accessibility to radial highways(径向高速公路可及性指数)
- TAX full-value property-tax rate per $10,000(1万美元的全价值财产税)
- PTRATIO pupil-teacher ratio by town(镇上学生与教师数量比例)
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town(社区内黑人比例)
- LSTAT % lower status of the population(2. `LSTAT`: 区域中被认为是低收入阶层的比率)
- MEDV Median value of owner-occupied homes in $1000's(房屋的中值价格)
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.
References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
X = boston.data
y = boston.target
for i in range(5):
print(X[i] ,'-->', y[i])
[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 0.00000000e+00
5.38000000e-01 6.57500000e+00 6.52000000e+01 4.09000000e+00
1.00000000e+00 2.96000000e+02 1.53000000e+01 3.96900000e+02
4.98000000e+00] --> 24.0
[ 2.73100000e-02 0.00000000e+00 7.07000000e+00 0.00000000e+00
4.69000000e-01 6.42100000e+00 7.89000000e+01 4.96710000e+00
2.00000000e+00 2.42000000e+02 1.78000000e+01 3.96900000e+02
9.14000000e+00] --> 21.6
[ 2.72900000e-02 0.00000000e+00 7.07000000e+00 0.00000000e+00
4.69000000e-01 7.18500000e+00 6.11000000e+01 4.96710000e+00
2.00000000e+00 2.42000000e+02 1.78000000e+01 3.92830000e+02
4.03000000e+00] --> 34.7
[ 3.23700000e-02 0.00000000e+00 2.18000000e+00 0.00000000e+00
4.58000000e-01 6.99800000e+00 4.58000000e+01 6.06220000e+00
3.00000000e+00 2.22000000e+02 1.87000000e+01 3.94630000e+02
2.94000000e+00] --> 33.4
[ 6.90500000e-02 0.00000000e+00 2.18000000e+00 0.00000000e+00
4.58000000e-01 7.14700000e+00 5.42000000e+01 6.06220000e+00
3.00000000e+00 2.22000000e+02 1.87000000e+01 3.96900000e+02
5.33000000e+00] --> 36.2
print(len(boston))
print(X.shape)
print(y.shape)
4
(506, 13)
(506,)
观察特征和label的相关性
这里使用matplotlib来直观反应相关性
import matplotlib.pyplot as plt
X_labels = boston.feature_names
for i,name in enumerate(X_labels):
X_i = X[:,i]
plt.figure(i+1)
plt.scatter(X_i,y)
plt.xlabel(name)
plt.ylabel('MEDV')
plt.title(name)
plt.show()
根据数据可以找出出几条明细规律:
犯罪率越大,价格越大
邻河房子的起价更高
住家平均房价数量越多,价格越高
社区内低收入人群越高,房价越低
进行训练集和测试集划分
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=13)
print("Train test split success!")
Train test split success!
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
import matplotlib.pyplot as plt
# features_new2 = preprocessing.PolynomialFeatures().fit_transform(features)
linear = linear_model.LinearRegression(normalize=True)
methods=[
linear_model.LinearRegression(normalize=True),
# linear_model.Ridge(),# 岭回归 这个不适合这种欠拟合场景
linear_model.Lasso(alpha = 0.01), # lasso
linear_model.LassoLars(alpha=.1),
linear_model.BayesianRidge(),
# linear_model.Lars(), # 最小角回归
# linear_model.LogisticRegression(),
DecisionTreeRegressor(),
AdaBoostRegressor(n_estimators=200, learning_rate=0.01), # adaboost 默认使用cart 回归树
GradientBoostingRegressor(alpha=0.01),
# linear_model.PassiveAggressiveRegressor()
]
for method in methods:
method.fit(X_train, y_train)
y_pred = method.predict(X_test)
y_pred_all = method.predict(X)
print(str(method)[:20], '...', method.score(X_test, y_test), mean_squared_error(y_test, y_pred))
if method :
for i in range(5):
print(y_test[i],'-->',y_pred[i])
plt.figure()
plt.scatter(y_test,y_pred)
plt.plot(y_test, y_test, color='red', linewidth=3)
plt.show()
# print(str(method)[:20], '_all...', method.score(features, label), mean_squared_error(label, y_pred_all))
LinearRegression(cop ... 0.731266198822 24.3636130537
12.0 --> 11.2372813213
15.2 --> 19.6569354177
21.0 --> 20.7727945311
24.0 --> 30.0178384426
19.4 --> 23.3468588414
Lasso(alpha=0.01, co ... 0.727369322035 24.7169068996
12.0 --> 10.997479607
15.2 --> 19.8015428984
21.0 --> 20.9103040897
24.0 --> 30.1566779847
19.4 --> 23.3037725316
LassoLars(alpha=0.1, ... 0.619695143322 34.4787307361
12.0 --> 13.5661711371
15.2 --> 21.0430044569
21.0 --> 23.4970861934
24.0 --> 27.8287636275
19.4 --> 22.2497974399
BayesianRidge(alpha_ ... 0.70544502235 26.7045807674
12.0 --> 10.3868573593
15.2 --> 20.0949244784
21.0 --> 21.1871783596
24.0 --> 30.7223799719
19.4 --> 23.332064073
DecisionTreeRegresso ... 0.835725689633 14.8932352941
12.0 --> 14.6
15.2 --> 16.1
21.0 --> 27.5
24.0 --> 32.0
19.4 --> 19.2
AdaBoostRegressor(ba ... 0.850469334357 13.5565651266
12.0 --> 12.168852459
15.2 --> 16.48
21.0 --> 22.2409326425
24.0 --> 26.2711111111
19.4 --> 21.6185897436
GradientBoostingRegr ... 0.908090516725 8.33258442609
12.0 --> 12.8209885964
15.2 --> 15.9543753266
21.0 --> 21.0684276931
24.0 --> 29.7570140491
19.4 --> 20.9639472861
可以看出,GDBT和adaboost都显示出了比较好的回归预测效果。
GradientBoostingRegr ... 0.904900131591 8.6218271956
AdaBoostRegressor(ba ... 0.85113787301 13.4959548983
导入r2_score
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
"""计算并返回预测值相比于预测值的分数"""
score = r2_score(y_true, y_predict, sample_weight=None, multioutput=None)
return score
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
def fit_model(X, y):
""" 基于输入数据 [X,y],利于网格搜索找到最优的决策树模型"""
cross_validator = KFold(n_splits=10, shuffle=False, random_state=None)
regressor = DecisionTreeRegressor()
params = {'max_depth':[1,2,3,4,5,6,7,8,9,10]}
scoring_fnc = make_scorer(performance_metric)
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cross_validator)
# 基于输入数据 [X,y],进行网格搜索
grid = grid.fit(X, y)
print("best param" + str(grid.best_params_))
print("best score" + str(grid.best_score_))
# 返回网格搜索后的最优模型
return grid.best_estimator_
# 基于训练数据,获得最优模型
optimal_reg = fit_model(X_train, y_train)
# 输出最优模型的 'max_depth' 参数
print("Parameter 'max_depth' is {} for the optimal model.".format(optimal_reg.get_params()['max_depth']))
best param{'max_depth': 9}
best score0.746367148236
Parameter 'max_depth' is 9 for the optimal model.
绘制学习曲线
了解整体训练误差下降趋势,观察正则化的影响
from sklearn.model_selection import learning_curve, validation_curve
from sklearn.model_selection import ShuffleSplit
def ModelLearning(X, y):
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)
train_sizes = np.rint(np.linspace(1, X.shape[0]*0.8 - 1, 9)).astype(int)
fig = plt.figure(figsize=(10,7))
for k, depth in enumerate([1,3,6,10]):
regressor = DecisionTreeRegressor(max_depth = depth)
sizes, train_scores, valid_scores = learning_curve(regressor, X, y, \
cv = cv, train_sizes = train_sizes, scoring = 'r2')
train_std = np.std(train_scores, axis = 1)
train_mean = np.mean(train_scores, axis = 1)
valid_std = np.std(valid_scores, axis = 1)
valid_mean = np.mean(valid_scores, axis = 1)
ax = fig.add_subplot(2, 2, k+1)
ax.plot(sizes, train_mean, 'o-', color = 'r', label = 'Training Score')
ax.plot(sizes, valid_mean, 'o-', color = 'g', label = 'Validation Score')
ax.fill_between(sizes, train_mean - train_std, \
train_mean + train_std, alpha = 0.15, color = 'r')
ax.fill_between(sizes, valid_mean - valid_std, \
valid_mean + valid_std, alpha = 0.15, color = 'g')
# Labels
ax.set_title('max_depth = %s'%(depth))
ax.set_xlabel('Number of Training Points')
ax.set_ylabel('r2_score')
ax.set_xlim([0, X.shape[0]*0.8])
ax.set_ylim([-0.05, 1.05])
ax.legend(bbox_to_anchor=(1.05, 2.05), loc='lower left', borderaxespad = 0.)
fig.suptitle('Decision Tree Regressor Learning Performances', fontsize = 16, y = 1.03)
fig.tight_layout()
fig.show()
def ModelComplexity(X, y):
""" Calculates the performance of the model as model complexity increases.
The learning and validation errors rates are then plotted.
随着模型复杂性的增加,计算模型的性能。
然后绘制学习和验证错误的速率。 """
# Create 10 cross-validation sets for training and testing
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)
# Vary the max_depth parameter from 1 to 10
max_depth = np.arange(1,11)
# Calculate the training and testing scores
train_scores, valid_scores = validation_curve(DecisionTreeRegressor(), X, y, \
param_name = "max_depth", param_range = max_depth, cv = cv, scoring = 'r2')
# Find the mean and standard deviation for smoothing
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
valid_mean = np.mean(valid_scores, axis=1)
valid_std = np.std(valid_scores, axis=1)
# Plot the validation curve
plt.figure(figsize=(7, 5))
plt.title('Decision Tree Regressor Complexity Performance')
plt.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score')
plt.plot(max_depth, valid_mean, 'o-', color = 'g', label = 'Validation Score')
plt.fill_between(max_depth, train_mean - train_std, \
train_mean + train_std, alpha = 0.15, color = 'r')
plt.fill_between(max_depth, valid_mean - valid_std, \
valid_mean + valid_std, alpha = 0.15, color = 'g')
# Visual aesthetics
plt.legend(loc = 'lower right')
plt.xlabel('Maximum Depth')
plt.ylabel('r2_score')
plt.ylim([-0.05,1.05])
plt.show()
def PredictTrials(X, y, fitter, data):
""" Performs trials of fitting and predicting data. """
# Store the predicted prices
prices = []
for k in range(10):
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, \
test_size = 0.2, random_state = k)
# Fit the data
reg = fitter(X_train, y_train)
# Make a prediction
pred = reg.predict([data[0]])[0]
prices.append(pred)
# Result
print("Trial {}: ${:,.2f}".format(k+1, pred))
# Display price range
print("\nRange in prices: ${:,.2f}".format(max(prices) - min(prices)))
ModelLearning(X_train, y_train)
ModelComplexity(X_train, y_train)
D:\Program Files\Anaconda3\lib\site-packages\matplotlib\figure.py:418: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
"matplotlib is currently using a non-GUI backend, "
<matplotlib.figure.Figure at 0x1aa7eeca080>