One example of a Data Model that we will be using is:
Simple Linear RegressionSimple Linear Regression is a method to help us understand the relationship between two variables:
The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.
Y:Response VariableX:Predictor VariablesLinear Function Yhat=a+bX
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm
X = df[['highway-mpg']]
Y = df['price']
#Fit the linear model using highway-mpg:
lm.fit(X,Y)
Yhat=lm.predict(X)
Yhat[0:5]
#What is the value of the intercept (a)?
lm.intercept_
#What is the value of the slope (b)?
lm.coef_
As we saw above, we should get a final linear model with the structure:
Yhat=a+bXPlugging in the actual values we get: Price = 38470.64 - 822.83 x highway-mpg
lm1 = LinearRegression()
lm1
lm1.fit(df[['engine-size']], df[['price']])
lm1
Yhat1=lm1.predict(X)
Yhat1[0:5]
#What is the value of the intercept (a)?
lm1.intercept_
#What is the value of the slope (b)?
lm1.coef_
What if we want to predict car price using more than one variable?
If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:
Y:Response VariableX_1:Predictor Variable 1X_2:Predictor Variable 2X_3:Predictor Variable 3X_4:Predictor Variable 4The equation is given by:
Yhat=a+b_1X_1+b_2X_2+b_3X_3+b_4X_4From the previous section we know that other good predictors of price could be:
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
#Fit the linear model using the four above-mentioned variables.
lm.fit(Z, df['price'])
lm.intercept_
lm.coef_
As we saw above, we should get a final linear function with the structure:
Yhat=a+b_1X_1+b_2X_2+b_3X_3+b_4X_4What is the linear function we get in this example?
Price = -15794.35437120974 + 53.5112049 x horsepower + 4.70487452 x curb-weight + 81.53080659 x engine-size + 35.87654175 x highway-mpg
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
We can see from this plot that price is negatively correlated to highway-mpg since the regression slope is negative.
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
Comparing the regression plot of "peak-rpm" and "highway-mpg", we see that the points for "highway-mpg" are much closer to the generated line and, on average, decrease. The points for "peak-rpm" have more spread around the predicted line and it is much harder to determine if the points are decreasing or increasing as the "peak-rpm" increases.
#The variable "highway-mpg" has a stronger correlation with "price",
#it is approximate -0.705115 compared to "peak-rpm" which is approximate -0.101593.
df[["peak-rpm","highway-mpg","price"]].corr()
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['peak-rpm'], df['price'])
plt.show()
Y_hat = lm.predict(Z)
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
We can see that the fitted values are reasonably close to the actual values since the two distributions overlap a bit. However, there is definitely some room for improvement.
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
There are different orders of polynomial regression:
We saw earlier that a linear model did not provide the best fit while using "highway-mpg" as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.
We will use the following function to plot the data:
¶def PlotPolly(model, independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
x = df['highway-mpg']
y = df['price']
# Here we use a polynomial of the 3rd order (cubic)
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)
PlotPolly(p, x, y, 'highway-mpg')
np.polyfit(x, y, 3)
We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.
# Here we use a polynomial of the 11rd order (cubic)
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p1)
PlotPolly(p1,x,y, 'Highway MPG')
The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by:
Yhat=a+b_1X_1+b_2X_2+b_3X_1X_2+b_4X_12+b_5X_22from sklearn.preprocessing import PolynomialFeatures
pr=PolynomialFeatures(degree=2)
pr
Z_pr=pr.fit_transform(Z)
Z.shape
In the original data, there are 200 samples and 4 features.
Z_pr.shape
After the transformation, there are 200 samples and 15 features
#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))
We can say that ~49.718% of the variation of the price is explained by this simple linear model "horsepower_fit".
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)
# fit the model
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))
We can say that ~80.938 % of the variation of price is explained by this multiple linear regression "multi_fit".
Y_predict_multifit = lm.predict(Z)
print('The mean square error of price and predicted value using multifit is: ', \
mean_squared_error(df['price'], Y_predict_multifit))
from sklearn.metrics import r2_score
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)
We can say that ~67.427 % of the variation of price is explained by this polynomial fit.
mean_squared_error(df['price'], p(x))
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#Create a new input:
new_input=np.arange(1, 100, 1).reshape(-1, 1)
#Fit the model:
lm.fit(X, Y)
lm
#Produce a prediction:
yhat=lm.predict(new_input)
yhat[0:5]
#We can plot the data:
plt.plot(new_input, yhat)
plt.show()
SLR VS MLR:
This R-squared in combination with the MSE show that MLR seems like the better model fit in this case compared to SLR.
SLR VS POLFIT:
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting "price" with "highway-mpg" as a predictor variable.
MLR VS POLFIT:
The MSE for the MLR is smaller than the MSE for the Polynomial Fit, while the R-squared for the MLR is also much larger than for the Polynomial Fit.
We conclude that the MLR model is the best model to be able to predict price from our dataset