Car Evaluation Sample Project - Part 4

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Fourth Step: Model Development¶

1. Model 1 --> Simple Linear Regression¶

One example of a Data Model that we will be using is:

Simple Linear Regression

Simple Linear Regression is a method to help us understand the relationship between two variables:

The predictor/independent variable (X)
The response/dependent variable (that we want to predict)(Y)

The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

Linear Function

a refers to the intercept of the regression line, in other words: the value of Y when X is 0
b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit

Let's load the modules for linear regression:¶

from sklearn.linear_model import LinearRegression

Create the linear regression object:¶

lm = LinearRegression()
lm

LinearRegression()

we want to look at how highway-mpg can help us predict car price. Using simple linear regression, we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.¶

X = df[['highway-mpg']]
Y = df['price']

#Fit the linear model using highway-mpg:
lm.fit(X,Y)

LinearRegression()

We can output a prediction:¶

Yhat=lm.predict(X)
Yhat[0:5]

array([16254.26934067, 17077.0977727 , 13785.78404458, 20368.41150083,
       17899.92620473])

#What is the value of the intercept (a)?
lm.intercept_

38470.63700549667

#What is the value of the slope (b)?
lm.coef_

array([-822.82843203])

What is the final estimated linear model we get?¶

As we saw above, we should get a final linear model with the structure:

Plugging in the actual values we get: Price = 38470.64 - 822.83 x highway-mpg

we want to look at how engine-size can help us predict car price. Using simple linear regression, we will create a linear function with "engine-size" as the predictor variable and the "price" as the response variable.¶

lm1 = LinearRegression()
lm1

LinearRegression()

lm1.fit(df[['engine-size']], df[['price']])
lm1

LinearRegression()

Yhat1=lm1.predict(X)
Yhat1[0:5]

array([[-3457.16322071],
       [-3624.02535991],
       [-2956.5768031 ],
       [-4291.47391672],
       [-3790.88749911]])

#What is the value of the intercept (a)?
lm1.intercept_

array([-7962.44097916])

#What is the value of the slope (b)?
lm1.coef_

array([[166.8621392]])

Yhat=-7962.44 + 166.86*X¶

Price=-7962.44 + 166.86*engine-size¶

2. Model 2 --> Multiple Linear Regression¶

What if we want to predict car price using more than one variable?

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

The equation is given by:

From the previous section we know that other good predictors of price could be:

Horsepower
Curb-weight
Engine-size
Highway-mpg

Let's develop a model using these variables as the predictor variables.

Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

#Fit the linear model using the four above-mentioned variables.
lm.fit(Z, df['price'])

LinearRegression()

lm.intercept_

-15794.35437120974

lm.coef_

array([53.5112049 ,  4.70487452, 81.53080659, 35.87654175])

What is the final estimated linear model that we get?¶

As we saw above, we should get a final linear function with the structure:

What is the linear function we get in this example?

Price = -15794.35437120974 + 53.5112049 x horsepower + 4.70487452 x curb-weight + 81.53080659 x engine-size + 35.87654175 x highway-mpg

Model 1 Evaluation Using Visualization¶

# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline

Let's visualize highway-mpg as potential predictor variable of price:¶

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

(0.0, 48174.175827199426)

We can see from this plot that price is negatively correlated to highway-mpg since the regression slope is negative.

plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)

(0.0, 47414.1)

Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for "highway-mpg" are much closer to the generated line and, on average, decrease. The points for "peak-rpm" have more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing as the "peak-rpm" increases.

#The variable "highway-mpg" has a stronger correlation with "price", 
#it is approximate -0.705115  compared to "peak-rpm" which is approximate -0.101593. 
df[["peak-rpm","highway-mpg","price"]].corr()

Residual Plot for "highway-mpg"¶

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

Residual Plot for "peak-rpm"¶

width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['peak-rpm'], df['price'])
plt.show()

Model 2 Evaluation Using Visualization¶

Y_hat = lm.predict(Z)

ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.

3. Model 3 --> Polynomial Regression¶

Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.

We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

There are different orders of polynomial regression:

Quadratic - 2nd Order

Cubic - 3rd Order

Higher-Order:

We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.

We will use the following function to plot the data:
¶

def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()

Let's get the variables:¶

x = df['highway-mpg']
y = df['price']

Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.¶

# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

        3         2
-1.552 x + 204.2 x - 8948 x + 1.378e+05

Let's plot the function:¶

PlotPolly(p, x, y, 'highway-mpg')

np.polyfit(x, y, 3)

array([-1.55173297e+00,  2.04232144e+02, -8.94817574e+03,  1.37751367e+05])

We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.

# Here we use a polynomial of the 11rd order (cubic) 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'Highway MPG')

            11             10             9           8         7
-1.273e-08 x  + 4.839e-06 x  - 0.0008229 x + 0.08259 x - 5.432 x
          6        5             4             3            2
 + 245.6 x - 7786 x + 1.729e+05 x - 2.634e+06 x + 2.62e+07 x - 1.532e+08 x + 3.987e+08

The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:

We can perform a polynomial transform on multiple features. First, we import the module:¶

from sklearn.preprocessing import PolynomialFeatures

We create a PolynomialFeatures object of degree 2:¶

pr=PolynomialFeatures(degree=2)
pr

PolynomialFeatures()

Z_pr=pr.fit_transform(Z)

Z.shape

(200, 4)

In the original data, there are 200 samples and 4 features.

Z_pr.shape

(200, 15)

After the transformation, there are 200 samples and 15 features

Measures for In-Sample Evaluation¶

Model 1: Simple Linear Regression¶

Let's calculate the R^2:¶

#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))

The R-square is:  0.49718675257265266

We can say that ~49.718% of the variation of the price is explained by this simple linear model "horsepower_fit".

Let's calculate the MSE:¶

Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])

The output of the first four predicted value is:  [16254.26934067 17077.0977727  13785.78404458 20368.41150083]

from sklearn.metrics import mean_squared_error

We can compare the predicted results with the actual results:¶

mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)

The mean square error of price and predicted value is:  31755395.41081296

Model 2: Multiple Linear Regression¶

Let's calculate the R^2:¶

# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))

The R-square is:  0.8093753249041752

We can say that ~80.938 % of the variation of price is explained by this multiple linear regression "multi_fit".

Let's calculate the MSE:¶

Y_predict_multifit = lm.predict(Z)

We can compare the predicted results with the actual results:¶

print('The mean square error of price and predicted value using multifit is: ', \
      mean_squared_error(df['price'], Y_predict_multifit))

The mean square error of price and predicted value using multifit is:  12038986.569462514

Model 3: Polynomial Fit¶

Let's calculate the R^2:¶

from sklearn.metrics import r2_score

r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)

The R-square value is:  0.6742706265540409

We can say that ~67.427 % of the variation of price is explained by this polynomial fit.

Let's calculate the MSE:¶

mean_squared_error(df['price'], p(x))

20571584.18879441

Prediction and Decision Making¶

import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

#Create a new input:
new_input=np.arange(1, 100, 1).reshape(-1, 1)

#Fit the model:
lm.fit(X, Y)
lm

LinearRegression()

#Produce a prediction:
yhat=lm.predict(new_input)
yhat[0:5]

array([37647.80857347, 36824.98014144, 36002.15170941, 35179.32327737,
       34356.49484534])

#We can plot the data:
plt.plot(new_input, yhat)
plt.show()

Decision Making: Determining a Good Model Fit¶

The model with the higher R-squared value is a better fit for the data
The model with the smallest MSE value is a better fit for the data.

SLR VS MLR:
This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.

SLR VS POLFIT:
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting "price" with "highway-mpg" as a predictor variable.

MLR VS POLFIT:
The MSE for the MLR is smaller than the MSE for the Polynomial Fit, while the R-squared for the MLR is also much larger than for the Polynomial Fit.

Conclusion:¶

We conclude that the MLR model is the best model to be able to predict price from our dataset

	peak-rpm	highway-mpg	price
peak-rpm	1.000000	-0.059319	-0.101593
highway-mpg	-0.059319	1.000000	-0.705115
price	-0.101593	-0.705115	1.000000

Car Evaluation Sample Project - Part 4

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Fourth Step: Model Development¶

1. Model 1 --> Simple Linear Regression¶

Let's load the modules for linear regression:¶

Create the linear regression object:¶

we want to look at how highway-mpg can help us predict car price. Using simple linear regression, we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.¶

We can output a prediction:¶

What is the final estimated linear model we get?¶

we want to look at how engine-size can help us predict car price. Using simple linear regression, we will create a linear function with "engine-size" as the predictor variable and the "price" as the response variable.¶

Yhat=-7962.44 + 166.86*X¶

Price=-7962.44 + 166.86*engine-size¶

2. Model 2 --> Multiple Linear Regression¶

What is the final estimated linear model that we get?¶

Model 1 Evaluation Using Visualization¶

Let's visualize highway-mpg as potential predictor variable of price:¶

Residual Plot for "highway-mpg"¶

Residual Plot for "peak-rpm"¶

Model 2 Evaluation Using Visualization¶

3. Model 3 --> Polynomial Regression¶

We will use the following function to plot the data:¶

Let's get the variables:¶

Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.¶

Let's plot the function:¶

We can perform a polynomial transform on multiple features. First, we import the module:¶

We create a PolynomialFeatures object of degree 2:¶

Measures for In-Sample Evaluation¶

Model 1: Simple Linear Regression¶

Let's calculate the R^2:¶

Let's calculate the MSE:¶

We can compare the predicted results with the actual results:¶

Model 2: Multiple Linear Regression¶

Let's calculate the R^2:¶

Let's calculate the MSE:¶

We can compare the predicted results with the actual results:¶

Model 3: Polynomial Fit¶

Let's calculate the R^2:¶

Let's calculate the MSE:¶

Prediction and Decision Making¶

Decision Making: Determining a Good Model Fit¶

Conclusion:¶

We will use the following function to plot the data:
¶