Car Evaluation Sample Project - Part 5

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Fifth Step: Model Evaluation and Refinement¶

Prepare the libraries and functions for plotting¶

from ipywidgets import interact, interactive, fixed, interact_manual

def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')

    plt.show()
    plt.close()

def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))


    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 

    xmax=max([xtrain.values.max(), xtest.values.max()])

    xmin=min([xtrain.values.min(), xtest.values.min()])

    x=np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
    plt.ylim([-10000, 60000])
    plt.ylabel('Price')
    plt.legend()

1. Training and Testing¶

First, we will place the tagret data "price" in a separate dataframe y_data, and remove it from x_data:¶

y_data = df['price']

x_data=df.drop('price',axis=1)

Now, we will split randomly the data for training and testing¶

from sklearn.model_selection import train_test_split

#The testing set is 10% of the total dataset
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

number of test samples : 20
number of training samples: 180

let's create a linear regression object, and fit the model using the feature "horse-power"¶

lre=LinearRegression()

lre.fit(x_train[['horsepower']], y_train)

LinearRegression()

Let's calculate the R^2 on the test data:¶

lre.score(x_test[['horsepower']], y_test)

0.5454534032667759

Let's calculate the R^2 on the train data:¶

lre.score(x_train[['horsepower']], y_train)

0.6572826747147018

We can see the R^2 is smaller using the test data compared to the training data

Cross-Validation Score¶

We input the object, the feature ("horsepower"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 4.¶

from sklearn.model_selection import cross_val_score

Rcross = cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)

The default scoring is R^2. Each element in the array has the average R^2 value for the fold:

Rcross

array([0.77474062, 0.5172957 , 0.74777703, 0.04701847])

We can calculate the average and standard deviation of our estimate:¶

print("The mean of the folds are", Rcross.mean(), "and the standard deviation is" , Rcross.std())

The mean of the folds are 0.5217079546458679 and the standard deviation is 0.2917543177373881

We can use negative squared error as a score by setting the parameter 'scoring' metric to 'neg_mean_squared_error'.¶

-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')

array([20648006.60031452, 43733821.19046848, 12543435.0168994 ,
       17587351.09090063])

2. Overfitting, Underfitting and Model Selection¶

Let's create Multiple Linear Regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features.¶

lr = LinearRegression()
lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)

LinearRegression()

Prediction using training data:¶

yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_train[0:5]

array([ 7262.8602918 ,   765.64454918, 34524.77168752,  6561.23620348,
        6079.9335172 ])

Prediction using test data:¶

yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_test[0:5]

array([11098.96538912, 10418.2608313 ,  6479.86875318, 23268.36603459,
        9494.7580886 ])

Let's perform some model evaluation using our training and testing data separately¶

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Let's examine the distribution of the predicted values of the training data:¶

Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)

So far, the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.

Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

PolynomialFeatures(degree=5)

LinearRegression()

array([ 7194.66699567, 10350.09590707, 10991.65504095, 18578.7297168 ,
        3260.96959553])

Predicted values: [ 7194.66699567 10350.09590707 10991.65504095 18578.7297168 ]
True values: [ 6575.  9988. 15580. 14399.]

0.7507740781447492

-405.19027017426595

Text(3, 0.75, 'Maximum R^2 ')

Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent in the range of 5000 to 15,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.

from sklearn.preprocessing import PolynomialFeatures

Let's create a degree 5 polynomial model, and let's use 55 percent of the data for training and the rest for testing:¶

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)

We will perform a degree 5 polynomial transformation on the feature 'horsepower'.

pr = PolynomialFeatures(degree=5)
x_train_pr = pr.fit_transform(x_train[['horsepower']])
x_test_pr = pr.fit_transform(x_test[['horsepower']])
pr

PolynomialFeatures(degree=5)

Now, let's create a Linear Regression model "poly" and train it.

poly = LinearRegression()
poly.fit(x_train_pr, y_train)

LinearRegression()

We can see the output of our model using the method "predict." We assign the values to "yhat".

yhat = poly.predict(x_test_pr)
yhat[0:5]

array([ 7194.66699567, 10350.09590707, 10991.65504095, 18578.7297168 ,
        3260.96959553])

Let's take the first five predicted values and compare it to the actual targets.¶

print("Predicted values:", yhat[0:4])
print("True values:", y_test[0:4].values)

Predicted values: [ 7194.66699567 10350.09590707 10991.65504095 18578.7297168 ]
True values: [ 6575.  9988. 15580. 14399.]

We will use the function "PollyPlot" that we defined at the beginning to display the training data, testing data, and the predicted function.

PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)

We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points.

R^2 of the training data:¶

poly.score(x_train_pr, y_train)

0.7507740781447492

R^2 of the test data:¶

poly.score(x_test_pr, y_test)

-405.19027017426595

A negative R^2 is a sign of overfitting.

Let's see how the R^2 changes on the test data for different order polynomials and then plot the results:¶

Rsqu_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)

    x_train_pr = pr.fit_transform(x_train[['horsepower']])

    x_test_pr = pr.fit_transform(x_test[['horsepower']])

    lr.fit(x_train_pr, y_train)

    Rsqu_test.append(lr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')

Text(3, 0.75, 'Maximum R^2 ')