Car Evaluation Sample Project - Part 3

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Third Step: Exploratory Analysis

Analyzing Individual Feature Patterns Using Visualization

In [52]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [53]:
# list the data types for each column
print(df.dtypes)
symboling               int64
normalized-losses       int32
make                   object
num-of-doors           object
body-style             object
drive-wheels           object
engine-location        object
wheel-base            float64
length                float64
width                 float64
height                float64
curb-weight             int64
engine-type            object
num-of-cylinders       object
engine-size             int64
fuel-system            object
bore                  float64
stroke                float64
compression-ratio     float64
horsepower              int32
peak-rpm              float64
city-mpg                int64
highway-mpg             int64
price                 float64
city-L/100km          float64
horsepower-binned    category
fuel-type-diesel        uint8
fuel-type-gas           uint8
aspiration-std          uint8
aspiration-turbo        uint8
dtype: object

calculate the correlation between variables of type "int64" or "float64"

In [54]:
df.corr()
Out[54]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke ... horsepower peak-rpm city-mpg highway-mpg price city-L/100km fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
symboling 1.000000 0.469772 -0.529145 -0.364511 -0.237262 -0.542261 -0.234743 -0.112069 -0.145790 0.008072 ... 0.074617 0.284033 -0.030158 0.041248 -0.083327 0.062423 -0.195381 0.195381 0.050995 -0.050995
normalized-losses 0.469772 1.000000 -0.057068 0.019433 0.086961 -0.377664 0.099404 0.112362 -0.029929 0.055673 ... 0.217322 0.239580 -0.225255 -0.182011 0.133999 0.238712 -0.101574 0.101574 0.006915 -0.006915
wheel-base -0.529145 -0.057068 1.000000 0.879005 0.814593 0.583789 0.787584 0.576779 0.501576 0.144733 ... 0.375610 -0.365045 -0.480029 -0.552211 0.589147 0.484047 0.306706 -0.306706 -0.254907 0.254907
length -0.364511 0.019433 0.879005 1.000000 0.857271 0.492955 0.881058 0.685531 0.610847 0.120929 ... 0.580583 -0.286688 -0.667658 -0.700186 0.691044 0.659174 0.210616 -0.210616 -0.229294 0.229294
width -0.237262 0.086961 0.814593 0.857271 1.000000 0.300995 0.867720 0.731100 0.548485 0.182864 ... 0.617115 -0.247337 -0.638155 -0.684700 0.752795 0.677111 0.243448 -0.243448 -0.304459 0.304459
height -0.542261 -0.377664 0.583789 0.492955 0.300995 1.000000 0.310660 0.076255 0.187917 -0.081103 ... -0.085829 -0.315820 -0.057087 -0.111568 0.137284 0.008923 0.281198 -0.281198 -0.086531 0.086531
curb-weight -0.234743 0.099404 0.787584 0.881058 0.867720 0.310660 1.000000 0.849090 0.644552 0.168669 ... 0.758076 -0.279422 -0.750390 -0.795515 0.834420 0.785868 0.221082 -0.221082 -0.322097 0.322097
engine-size -0.112069 0.112362 0.576779 0.685531 0.731100 0.076255 0.849090 1.000000 0.572878 0.208133 ... 0.822689 -0.256681 -0.651002 -0.679877 0.872337 0.745337 0.070925 -0.070925 -0.110278 0.110278
bore -0.145790 -0.029929 0.501576 0.610847 0.548485 0.187917 0.644552 0.572878 1.000000 -0.051087 ... 0.566807 -0.267061 -0.581272 -0.590672 0.543433 0.553954 0.055395 -0.055395 -0.229338 0.229338
stroke 0.008072 0.055673 0.144733 0.120929 0.182864 -0.081103 0.168669 0.208133 -0.051087 1.000000 ... 0.100881 -0.066021 -0.040547 -0.040170 0.083298 0.041310 0.240046 -0.240046 -0.215805 0.215805
compression-ratio -0.181073 -0.114738 0.249689 0.159203 0.189008 0.259526 0.156444 0.029005 0.002034 0.186780 ... -0.214260 -0.436303 0.330897 0.267929 0.071176 -0.298898 0.985228 -0.985228 -0.307074 0.307074
horsepower 0.074617 0.217322 0.375610 0.580583 0.617115 -0.085829 0.758076 0.822689 0.566807 0.100881 ... 1.000000 0.108163 -0.822488 -0.804702 0.809657 0.889613 -0.168755 0.168755 -0.251799 0.251799
peak-rpm 0.284033 0.239580 -0.365045 -0.286688 -0.247337 -0.315820 -0.279422 -0.256681 -0.267061 -0.066021 ... 0.108163 1.000000 -0.116364 -0.059319 -0.101593 0.116528 -0.476430 0.476430 0.190772 -0.190772
city-mpg -0.030158 -0.225255 -0.480029 -0.667658 -0.638155 -0.057087 -0.750390 -0.651002 -0.581272 -0.040547 ... -0.822488 -0.116364 1.000000 0.972024 -0.687186 -0.949692 0.264947 -0.264947 0.191068 -0.191068
highway-mpg 0.041248 -0.182011 -0.552211 -0.700186 -0.684700 -0.111568 -0.795515 -0.679877 -0.590672 -0.040170 ... -0.804702 -0.059319 0.972024 1.000000 -0.705115 -0.929940 0.197989 -0.197989 0.243429 -0.243429
price -0.083327 0.133999 0.589147 0.691044 0.752795 0.137284 0.834420 0.872337 0.543433 0.083298 ... 0.809657 -0.101593 -0.687186 -0.705115 1.000000 0.790291 0.110417 -0.110417 -0.179762 0.179762
city-L/100km 0.062423 0.238712 0.484047 0.659174 0.677111 0.008923 0.785868 0.745337 0.553954 0.041310 ... 0.889613 0.116528 -0.949692 -0.929940 0.790291 1.000000 -0.240676 0.240676 -0.158912 0.158912
fuel-type-diesel -0.195381 -0.101574 0.306706 0.210616 0.243448 0.281198 0.221082 0.070925 0.055395 0.240046 ... -0.168755 -0.476430 0.264947 0.197989 0.110417 -0.240676 1.000000 -1.000000 -0.407787 0.407787
fuel-type-gas 0.195381 0.101574 -0.306706 -0.210616 -0.243448 -0.281198 -0.221082 -0.070925 -0.055395 -0.240046 ... 0.168755 0.476430 -0.264947 -0.197989 -0.110417 0.240676 -1.000000 1.000000 0.407787 -0.407787
aspiration-std 0.050995 0.006915 -0.254907 -0.229294 -0.304459 -0.086531 -0.322097 -0.110278 -0.229338 -0.215805 ... -0.251799 0.190772 0.191068 0.243429 -0.179762 -0.158912 -0.407787 0.407787 1.000000 -1.000000
aspiration-turbo -0.050995 -0.006915 0.254907 0.229294 0.304459 0.086531 0.322097 0.110278 0.229338 0.215805 ... 0.251799 -0.190772 -0.191068 -0.243429 0.179762 0.158912 0.407787 -0.407787 -1.000000 1.000000

21 rows × 21 columns

In [55]:
df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()
Out[55]:
bore stroke compression-ratio horsepower
bore 1.000000 -0.051087 0.002034 0.566807
stroke -0.051087 1.000000 0.186780 0.100881
compression-ratio 0.002034 0.186780 1.000000 -0.214260
horsepower 0.566807 0.100881 -0.214260 1.000000

Let's find the scatterplot of "engine-size" and "price".

In [56]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
Out[56]:
(0.0, 53304.355388382566)

We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.

In [57]:
df[["engine-size", "price"]].corr()
Out[57]:
engine-size price
engine-size 1.000000 0.872337
price 0.872337 1.000000

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".

In [58]:
sns.regplot(x="highway-mpg", y="price", data=df)
Out[58]:
<AxesSubplot:xlabel='highway-mpg', ylabel='price'>
In [59]:
df[['highway-mpg', 'price']].corr()
Out[59]:
highway-mpg price
highway-mpg 1.000000 -0.705115
price -0.705115 1.000000

Let's see if "peak-rpm" is a predictor variable of "price".

In [60]:
sns.regplot(x="peak-rpm", y="price", data=df)
Out[60]:
<AxesSubplot:xlabel='peak-rpm', ylabel='price'>

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal

In [61]:
df[['peak-rpm','price']].corr()
Out[61]:
peak-rpm price
peak-rpm 1.000000 -0.101593
price -0.101593 1.000000
In [62]:
df[["stroke","price"]].corr()
Out[62]:
stroke price
stroke 1.000000 0.083298
price 0.083298 1.000000
In [63]:
sns.regplot(x="stroke", y="price", data=df)
Out[63]:
<AxesSubplot:xlabel='stroke', ylabel='price'>

There is a weak correlation between the variable 'stroke' and 'price'

Categorical Variables

In [64]:
#check body-style
sns.boxplot(x="body-style", y="price", data=df)
Out[64]:
<AxesSubplot:xlabel='body-style', ylabel='price'>

We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price

In [65]:
sns.boxplot(x="engine-location", y="price", data=df)
Out[65]:
<AxesSubplot:xlabel='engine-location', ylabel='price'>

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

In [67]:
# check drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
Out[67]:
<AxesSubplot:xlabel='drive-wheels', ylabel='price'>

Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.

Descriptive Statistical Analysis

In [68]:
df.describe()
Out[68]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke ... horsepower peak-rpm city-mpg highway-mpg price city-L/100km fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
count 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 ... 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000 200.000000
mean 0.830000 122.000000 98.848000 0.837232 0.915250 0.899523 2555.705000 126.860000 3.329981 3.259816 ... 103.365000 5118.259901 25.200000 30.705000 13205.690000 9.937914 0.100000 0.900000 0.820000 0.180000
std 1.248557 32.076542 6.038261 0.059333 0.029207 0.040610 518.594552 41.650501 0.268562 0.314177 ... 37.455525 479.240743 6.432487 6.827227 7966.982558 2.539415 0.300753 0.300753 0.385152 0.385152
min -2.000000 65.000000 86.600000 0.678039 0.837500 0.799331 1488.000000 61.000000 2.540000 2.070000 ... 48.000000 4150.000000 13.000000 16.000000 5118.000000 4.795918 0.000000 0.000000 0.000000 0.000000
25% 0.000000 100.250000 94.500000 0.800937 0.891319 0.869565 2163.000000 97.750000 3.150000 3.117500 ... 70.000000 4800.000000 19.000000 25.000000 7775.000000 7.833333 0.000000 1.000000 1.000000 0.000000
50% 1.000000 122.000000 97.000000 0.832292 0.909722 0.904682 2414.000000 119.500000 3.310000 3.290000 ... 95.000000 5162.995050 24.000000 30.000000 10270.000000 9.791667 0.000000 1.000000 1.000000 0.000000
75% 2.000000 138.250000 102.400000 0.881788 0.926042 0.928512 2928.250000 142.000000 3.582500 3.410000 ... 116.000000 5500.000000 30.000000 34.000000 16500.750000 12.368421 0.000000 1.000000 1.000000 0.000000
max 3.000000 256.000000 120.900000 1.000000 1.000000 1.000000 4066.000000 326.000000 3.940000 4.170000 ... 262.000000 6600.000000 49.000000 54.000000 45400.000000 18.076923 1.000000 1.000000 1.000000 1.000000

8 rows × 21 columns

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:

In [69]:
df.describe(include=['object'])
Out[69]:
make num-of-doors body-style drive-wheels engine-location engine-type num-of-cylinders fuel-system
count 200 200 200 200 200 200 200 200
unique 22 2 5 3 2 6 7 8
top toyota four sedan fwd front ohc four mpfi
freq 32 115 94 118 197 145 156 91

Value Counts

In [70]:
df['drive-wheels'].value_counts()
Out[70]:
fwd    118
rwd     74
4wd      8
Name: drive-wheels, dtype: int64

We can convert the series to a dataframe as follows:

In [71]:
df['drive-wheels'].value_counts().to_frame()
Out[71]:
drive-wheels
fwd 118
rwd 74
4wd 8

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts':

In [72]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts
Out[72]:
value_counts
fwd 118
rwd 74
4wd 8

Now let's rename the index to 'drive-wheels':

In [73]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts
Out[73]:
value_counts
drive-wheels
fwd 118
rwd 74
4wd 8

We can repeat the above process for the variable 'engine-location'.

In [74]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
Out[74]:
value_counts
engine-location
front 197
rear 3

After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

Correlation and Causation

  • p-value is < 0.001: we say there is strong evidence that the correlation is significant.
  • the p-value is < 0.05: there is moderate evidence that the correlation is significant.
  • the p-value is < 0.1: there is weak evidence that the correlation is significant.
  • the p-value is > 0.1: there is no evidence that the correlation is significant.
In [75]:
from scipy import stats

Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

In [76]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The Pearson Correlation Coefficient is 0.5891470005448702  with a P-value of P = 4.4570195020504053e-20

Since the p-value is < 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.589).

Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [77]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.8096565575365611  with a P-value of P =  1.0468839625927006e-47

Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

In [78]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.6910440897821903  with a P-value of P =  9.96096322234889e-30

Since the p-value is < 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':

In [79]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )
The Pearson Correlation Coefficient is 0.7527948631832608  with a P-value of P = 8.256714148309272e-38

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.752).

Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':

In [80]:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is 0.8344204348498461  with a P-value of P =  3.9699775360220333e-53

Since the p-value is < 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834)

Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':

In [81]:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)
The Pearson Correlation Coefficient is 0.8723367498521142  with a P-value of P = 1.8977171466563487e-63

Since the p-value is < 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':

In [82]:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value )
The Pearson Correlation Coefficient is 0.5434325935555682  with a P-value of P =   9.207487524195266e-17

Since the p-value is < 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

Let's calculate the Pearson Correlation Coefficient and P-value of 'city-mpg' and 'price':

In [83]:
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
The Pearson Correlation Coefficient is -0.6871861020862686  with a P-value of P =  2.729256568479228e-29

Since the p-value is < 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.

Let's calculate the Pearson Correlation Coefficient and P-value of 'highway-mpg' and 'price':

In [84]:
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )
The Pearson Correlation Coefficient is -0.7051147088046402  with a P-value of P =  2.197326053158553e-31

Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of about -0.705 shows that the relationship is negative and moderately strong.

Conclusion:

Important continous numerical variables:

Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore

Important categorical variables:

Drive-wheels