## Dataset: Boston housing

*First project of Udacity Machine Learning Nanodegree*

A description of the dataset can be found here. This dataset concerns housing values in suburbs of Boston. There is **506 instances** of **14 attributes** each in the dataset. Generally, this dataset is suitable for regression task. Attributes in the datasets:

- CRIM: Per capita crime rate by town.
- ZN: Proportion of residental land zoned for lots over 25,000 sq.ft.
- INDUS: Proportion of non-retail business acres per town.
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- NOX: Nitric oxides concentration (parts per 10 million).
- RM: Average number of rooms per dwelling.
- AGE: Proportion of owner-occupied units built prior to 1940.
- DIS: Weighted distances to five Boston employment centres.
- RAD: Index of accessibility to radial highways.
- TAX: Full-value property-tax rate per $10,000.
- PTRATIO: pupil-teacher ratio by town.
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
- LSTAT: % lower status of the population.
- MEDV: Median value of owner-occupied homes in $1000’s (usually the target).

The Boston housing dataset is shipped with `scikit-learn`

. The same description of the data as above can be obtained from `scikit-learn.datasets.load_boston().DESCR`

.

## Requirement

This project requires `Python-3.5.2`

, `jupyter-1.0.0`

, `numpy-1.11.2`

, `scikit-learn-0.18.1`

, and `matplotlib-1.5.3`

installed. I recommend to use Anaconda to manage Python virtual environments and packages.

First, we import necessary packages:

`numpy`

for numeric computations.`matplotlib.pyplot`

for visualization. (inline means the figures are shown in the notebook)`sklearn`

for boston housing dataset and decision tree model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# Importing a few necessary libraries
import numpy as np
import matplotlib.pyplot as pl
import pandas as pd
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
# Make matplotlib show plots inline
%matplotlib inline
# Create our client's feature set for
# which we will be predicting a selling price
CLIENT_FEATURES = [[11.95, 0.00, 18.100, 0, 0.6590, 5.6090, 90.00, \
1.385, 24, 680.0, 20.20, 332.09, 12.13]]
# Load the Boston Housing dataset into the city_data variable
city_data = datasets.load_boston()
# Initialize the housing prices and housing features
housing_prices = city_data.target
housing_features = city_data.data
print("Boston Housing dataset loaded successfully!")

```
Boston Housing dataset loaded successfully!
```

1

city_data.feature_names

```
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'],
dtype='<U7')
```

I would like to see the data in a nice table format, so I load the data into a `pandas.DataFrame`

and printed the first five rows with `.head()`

.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

pdict = {'CRIM': city_data.data[:,0],
'ZN': city_data.data[:,1],
'INDUS': city_data.data[:,2],
'CHAS': city_data.data[:,3],
'NOX': city_data.data[:,4],
'RM': city_data.data[:,5],
'AGE': city_data.data[:,6],
'DIS': city_data.data[:,7],
'RAD': city_data.data[:,8],
'TAX': city_data.data[:,9],
'PTRATIO': city_data.data[:,10],
'B': city_data.data[:,11],
'LSTAT': city_data.data[:,12],
'MEDV': city_data.target[:]}
ptable = pd.DataFrame(pdict)

1

ptable.head()

AGE | B | CHAS | CRIM | DIS | INDUS | LSTAT | MEDV | NOX | PTRATIO | RAD | RM | TAX | ZN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 65.2 | 396.90 | 0.0 | 0.00632 | 4.0900 | 2.31 | 4.98 | 24.0 | 0.538 | 15.3 | 1.0 | 6.575 | 296.0 | 18.0 |

1 | 78.9 | 396.90 | 0.0 | 0.02731 | 4.9671 | 7.07 | 9.14 | 21.6 | 0.469 | 17.8 | 2.0 | 6.421 | 242.0 | 0.0 |

2 | 61.1 | 392.83 | 0.0 | 0.02729 | 4.9671 | 7.07 | 4.03 | 34.7 | 0.469 | 17.8 | 2.0 | 7.185 | 242.0 | 0.0 |

3 | 45.8 | 394.63 | 0.0 | 0.03237 | 6.0622 | 2.18 | 2.94 | 33.4 | 0.458 | 18.7 | 3.0 | 6.998 | 222.0 | 0.0 |

4 | 54.2 | 396.90 | 0.0 | 0.06905 | 6.0622 | 2.18 | 5.33 | 36.2 | 0.458 | 18.7 | 3.0 | 7.147 | 222.0 | 0.0 |

## Statistical Analysis and Data Exploration

Let’s quickly investigate a few basic statistics about the dataset and look at the `CLIENT_FEATURES`

to see how the data relates to it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

# Number of houses and features in the dataset
total_houses, total_features = city_data.data.shape
# Minimum housing value in the dataset
minimum_price = housing_prices.min()
# Maximum housing value in the dataset
maximum_price = housing_prices.max()
# Mean house value of the dataset
mean_price = housing_prices.mean()
# Median house value of the dataset
median_price = np.median(housing_prices)
# Standard deviation of housing values of the dataset
std_dev = housing_prices.std()
# Show the calculated statistics
print("Boston Housing dataset statistics (in $1000's):\n")
print("Total number of houses:", total_houses)
print("Total number of features:", total_features)
print("Minimum house price:", minimum_price)
print("Maximum house price:", maximum_price)
print("Mean house price: {0:.3f}".format(mean_price))
print("Median house price:", median_price)
print("Standard deviation of house price: {0:.3f}".format(std_dev))

```
Boston Housing dataset statistics (in $1000's):
Total number of houses: 506
Total number of features: 13
Minimum house price: 5.0
Maximum house price: 50.0
Mean house price: 22.533
Median house price: 21.2
Standard deviation of house price: 9.188
```

By intuition, the top 3 deciding factors is crime rate (CRIM), proportion of blacks (B), and the accessibility to the highway (RAD).

**CRIM**: Area with low crime rate must have higher security, income, insurrance, and better life in general. Hence the price of houses must be affected by this factor.**B**: Many people might think that area with many blacks will have be not so safe. Therefore the price might be higher for residence with smaller blacks porpotion.**RAD**: The accessibility to the highway might also be desirable as it is more convenient to go to work.

Let’s examine our client. There features we selected have the index `0`

(CRIM), `8`

(RAD), and `11`

(B).

1

print(CLIENT_FEATURES)

```
[[11.95, 0.0, 18.1, 0, 0.659, 5.609, 90.0, 1.385, 24, 680.0, 20.2, 332.09, 12.13]]
```

1
2
3

print('Client CRIM = ' + str(CLIENT_FEATURES[0][0]))
print('Client RAD = ' + str(CLIENT_FEATURES[0][8]))
print('Client B = ' + str(CLIENT_FEATURES[0][11]))

```
Client CRIM = 11.95
Client RAD = 24
Client B = 332.09
```

Our client’s crime rate is quite high!

## Picking evaluation method

We first shuffle the data using `sklearn.utils.shuffle(*arrays, *options)`

. This function will return new shuffled data and target arrays. Then we split data 70-30 to use for training and testing using `train_test_split(...)`

.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
def shuffle_split_data(X, y):
"""
Shuffles and splits data into 70% training and 30% testing subsets,
then returns the training and testing subsets.
"""
# Shuffled data
X_s, y_s = shuffle(X, y, random_state=0)
# Split the data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X_s, y_s,
test_size=0.3,
random_state=0)
# Return the training and testing data subsets
return X_train, y_train, X_test, y_test
# Test shuffle_split_data
try:
X_train, y_train, X_test, y_test = shuffle_split_data(housing_features,
housing_prices)
print("Successfully shuffled and split the data!")
except:
print("Something went wrong with shuffling and splitting the data.")
print("Shape of training data: ", X_train.shape)
print("Shape of training target: ", y_train.shape)
print("Shape of testing data: ", X_test.shape)
print("Shape of testing target: ", y_test.shape)

```
Successfully shuffled and split the data!
Shape of training data: (354, 13)
Shape of training target: (354,)
Shape of testing data: (152, 13)
Shape of testing target: (152,)
```

Splitting the data for training and testing allows us to evaluate our model by looking at the performance on training and testing data. The learning curves for training and testing show us if the model is underfitting (bias) or overfitting (variation).

MSE or MAE are better choices for regression task. Metrics like accuracy, precision, recall, f1-score are often used for evaluating a classification problem.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
def performance_metric(y_true, y_predict):
"""
Calculates and returns the total error between true
and predicted values
based on a performance metric chosen by the student.
"""
error = MAE(y_true, y_predict)
return error
# Test performance_metric
try:
total_error = performance_metric(y_train, y_train)
print("Successfully performed a metric calculation!")
except:
print("Something went wrong with performing a metric calculation.")

```
Successfully performed a metric calculation!
```

As mentioned before, mean squared error (MSE) and mean absolute error (MAE) are both appropriate for predicting housing prices. MAE is robust to outlier but it is not always differentible for grtadient methods. In contrast, MSE is always differentible but it weights the high loss (outlier) heavily. Since Boston dataset contains many outliers, where housing price is high for some special reason, MAE is the most appropriate error evaluation.

`fit_model`

performs grid search cross validation and return the best estimator. GridSearchCV is the object provided by `scikit-learn`

to search for the best paratemeters using cross-validation and then return the best estimator. To use `GridSearchCV`

, we need to pass the `estimator`

, the dictionary containing the parameter grid `param_grid`

, the scrorer callable object `scoring`

, and optionally the number of cross-validation fold `cv`

. Note that when we use error metrics (MAE, MSE, etc.), we need to specify `greater_is_better = False`

in `make_scorer`

function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
def fit_model(X, y):
"""
Tunes a decision tree regressor
model using GridSearchCV on the input data X
and target labels y and returns this optimal model.
"""
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Set up the parameters we wish to tune
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9,10)}
# Make an appropriate scoring function
scoring_function = make_scorer(performance_metric,
greater_is_better=False)
# Make the GridSearchCV object
reg = GridSearchCV(regressor, parameters,
scoring_function)
# Fit the learner to the data to obtain the optimal
# model with tuned parameters
reg.fit(X, y)
# Return the optimal model
return reg.best_estimator_
# Test fit_model on entire dataset
try:
reg = fit_model(housing_features, housing_prices)
print("Successfully fit a model!")
except:
print("Something went wrong with fitting a model.")

```
Successfully fit a model!
```

Grid search algorithm is a brute-force hyper-parameter search for the best estimator configuration. Grid search algorithm is applicable when we need to find the best parameters for our learning model. This algorithm searches through all possible hyper-parameter configurations, evaluates the error of each configuration, then returns the best one. The exhaustive search guarantee the best model configuration is returned. However, due to the nature of a brute force algorithm, grid search might not be suitable for models with a large number of hyper-parameters or the hyper-parameters have large search spaces.

Cross-validation is a data reuse technique to maximize the usage of data for training and testing. Specifying an integer `k`

in advanced, for each model, cross-validation scheme splits the given training data into k-fold, runs the training procedure k times with k-1 folds of data as training and the remaining 1 fold as testing data. The final error is then averaged for k folds. Based on this cross-validation error, we can evaluate our model for overfitting. In contrast, if we evaluate the error of our model on the training dataset, there is a high chance that the learning algorithm will overfit the data (it can just remember the exact input-output without generalizing the data). In grid search, each model’s configuration might have different performance on the training dataset. Without cross-validation, the grid search algorithm might select the model configuration that best *overfits* the data. On the other hand, with cross-validation, grid search can account for variation in the model’s prediction and prevent overfitting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

def learning_curves(X_train, y_train, X_test, y_test):
"""
Calculates the performance of several models with
varying sizes of training data. The learning and testing
error rates for each model are then plotted.
"""
print("Creating learning curve graphs for max_depths of 1, 3, 6, and 10. . .")
# Create the figure window
fig = pl.figure(figsize=(10,8))
# We will vary the training set size so that
# we have 50 different sizes
sizes = np.rint(np.linspace(1, len(X_train), 50)).astype(int)
train_err = np.zeros(len(sizes))
test_err = np.zeros(len(sizes))
# Create four different models based on max_depth
for k, depth in enumerate([1,3,6,10]):
for i, s in enumerate(sizes):
# Setup a decision tree regressor so that
# it learns a tree with max_depth = depth
regressor = DecisionTreeRegressor(max_depth = depth)
# Fit the learner to the training data
regressor.fit(X_train[:s], y_train[:s])
# Find the performance on the training set
train_err[i] = performance_metric(y_train[:s],
regressor.predict(X_train[:s]))
# Find the performance on the testing set
test_err[i] = performance_metric(y_test,
regressor.predict(X_test))
# Subplot the learning curve graph
ax = fig.add_subplot(2, 2, k+1)
ax.plot(sizes, test_err, lw = 2, label = 'Testing Error')
ax.plot(sizes, train_err, lw = 2, label = 'Training Error')
ax.legend()
ax.set_title('max_depth = %s'%(depth))
ax.set_xlabel('Number of Data Points in Training Set')
ax.set_ylabel('Total Error')
ax.set_xlim([0, len(X_train)])
# Visual aesthetics
fig.suptitle('Decision Tree Regressor Learning Performances',
fontsize=18, y=1.03)
fig.tight_layout()
pl.show()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

def model_complexity(X_train, y_train, X_test, y_test):
"""
Calculates the performance of the model
as model complexity increases. The learning and
testing errors rates are then plotted.
"""
print("Creating a model complexity graph. . . ")
# We will vary the max_depth of a decision tree
# model from 1 to 14
max_depth = np.arange(1, 14)
train_err = np.zeros(len(max_depth))
test_err = np.zeros(len(max_depth))
for i, d in enumerate(max_depth):
# Setup a Decision Tree Regressor so that it learns
# a tree with depth d
regressor = DecisionTreeRegressor(max_depth = d)
# Fit the learner to the training data
regressor.fit(X_train, y_train)
# Find the performance on the training set
train_err[i] = performance_metric(y_train,
regressor.predict(X_train))
# Find the performance on the testing set
test_err[i] = performance_metric(y_test,
regressor.predict(X_test))
# Plot the model complexity graph
pl.figure(figsize=(7, 5))
pl.title('Decision Tree Regressor Complexity Performance')
pl.plot(max_depth, test_err, lw=2, label = 'Testing Error')
pl.plot(max_depth, train_err, lw=2, label = 'Training Error')
pl.legend()
pl.xlabel('Maximum Depth')
pl.ylabel('Total Error')
pl.show()

## Analyzing Model Performance

1

learning_curves(X_train, y_train, X_test, y_test)

```
Creating learning curve graphs for max_depths of 1, 3, 6, and 10. . .
```

The `learning_curves`

function trains the input data with 4 different max-depth values and then plot the learning curves as above. The `DecisionTreeRegressor`

models are trained on the training dataset with increasing data size (50 models in total). The error of prediction is then measured on the training data used (green line) and the testing data (blue line).

`max_depth = 6`

curve shows large variation as we increase the training data size. Throughout the training processes, the training error was small and it increased a little as the training size increased. At the same time, we can see the testing error had an overall downward trend, but no clear sign of convergence. Besides, the testing error has a large variation while the testing error is small hints that our model might overfit the training data.

1

model_complexity(X_train, y_train, X_test, y_test)

```
Creating a model complexity graph. . .
```

The figure above shows the model compexity curve. The function `model_complexity`

trains each learning models using the whole training data and then computes the error rate on training and testing error. When the `max_depth`

is 1, both testing and training error is high. In this case, the model could not generalize the data well, lead to high bias. On the other hand, when `max_depth`

is 10, we have a small training error, but large testing error. The model might overfitted the training data, hence high variance.

Based on the model complexity curve, we can say the best `max_depth`

value is probably within the range 4 to 8. Any value lower than 4 leads to high bias, while values larger than 8 lead to high variance.

## Model Prediction

In the previous code block, we have defined the function `fit_model`

in which the `max_depth`

is selected by `GridSearchCV`

. As expected from observing the model compexity curve above, this function yields `max_depth`

of 5 in MAE and MSE error metrics.

1
2
3
4
5
6
7

max_depths = []
for _ in range(100):
reg = fit_model(housing_features, housing_prices)
max_depths.append(reg.get_params()['max_depth'])
print("GridSearchCV max_depth result for DecisionTreeRegression model: ")
print("Median:", np.median(max_depths))
print("Mean:", np.mean(max_depths), ", Standard deviation:", np.std(max_depths))

```
GridSearchCV max_depth result for DecisionTreeRegression model:
Median: 5.0
Mean: 5.06 , Standard deviation: 1.1297787394
```

The code block below gives the prediction for our client:

1
2

sale_price = reg.predict(CLIENT_FEATURES)
print("Predicted value of client's home: {0:.3f}".format(sale_price[0]))

```
Predicted value of client's home: 20.766
```

Compared to the basic statistic, our client’s home value is well below the mean and the median, but it is within the standard deriviation. Based on the learning and testing curves above, we can see that this model is about $3,000 off in prediction on average. Compared to the $20,000 price range, model is quite accurate. However, in order to put this model into real use, the human prediction error should be compared with this model. If this model perform better or similar to human error, I will definitely use this model as the tool for clients. Otherwise, this model can only serve as an reference point estimation for human evaluation. Personally I doubt human prediction and believes in statistics more. As mentioned in the book “Thinking fast and slow” by Daniel Kahneman, human judgements are biased.