Car Evaluation Sample Project - Part 3

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Third Step: Exploratory Analysis¶

Analyzing Individual Feature Patterns Using Visualization¶

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# list the data types for each column
print(df.dtypes)

symboling               int64
normalized-losses       int32
make                   object
num-of-doors           object
body-style             object
drive-wheels           object
engine-location        object
wheel-base            float64
length                float64
width                 float64
height                float64
curb-weight             int64
engine-type            object
num-of-cylinders       object
engine-size             int64
fuel-system            object
bore                  float64
stroke                float64
compression-ratio     float64
horsepower              int32
peak-rpm              float64
city-mpg                int64
highway-mpg             int64
price                 float64
city-L/100km          float64
horsepower-binned    category
fuel-type-diesel        uint8
fuel-type-gas           uint8
aspiration-std          uint8
aspiration-turbo        uint8
dtype: object

calculate the correlation between variables of type "int64" or "float64"¶

df.corr()

df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

Let's find the scatterplot of "engine-size" and "price".¶

# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

(0.0, 53304.355388382566)

We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.¶

df[["engine-size", "price"]].corr()

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".¶

sns.regplot(x="highway-mpg", y="price", data=df)

<AxesSubplot:xlabel='highway-mpg', ylabel='price'>

df[['highway-mpg', 'price']].corr()

Let's see if "peak-rpm" is a predictor variable of "price".¶

sns.regplot(x="peak-rpm", y="price", data=df)

<AxesSubplot:xlabel='peak-rpm', ylabel='price'>

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal

df[['peak-rpm','price']].corr()

df[["stroke","price"]].corr()

sns.regplot(x="stroke", y="price", data=df)

<AxesSubplot:xlabel='stroke', ylabel='price'>

There is a weak correlation between the variable 'stroke' and 'price'

Categorical Variables¶

#check body-style
sns.boxplot(x="body-style", y="price", data=df)

<AxesSubplot:xlabel='body-style', ylabel='price'>

We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price

sns.boxplot(x="engine-location", y="price", data=df)

<AxesSubplot:xlabel='engine-location', ylabel='price'>

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

# check drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)

<AxesSubplot:xlabel='drive-wheels', ylabel='price'>

Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.

Descriptive Statistical Analysis¶

df.describe()

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:¶

df.describe(include=['object'])

Value Counts¶

df['drive-wheels'].value_counts()

fwd    118
rwd     74
4wd      8
Name: drive-wheels, dtype: int64

We can convert the series to a dataframe as follows:¶

df['drive-wheels'].value_counts().to_frame()

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts':¶

drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

Now let's rename the index to 'drive-wheels':¶

drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

We can repeat the above process for the variable 'engine-location'.¶

# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

Correlation and Causation¶

p-value is 0.001: we say there is strong evidence that the correlation is significant.
the p-value is 0.05: there is moderate evidence that the correlation is significant.
the p-value is 0.1: there is weak evidence that the correlation is significant.
the p-value is 0.1: there is no evidence that the correlation is significant.

from scipy import stats

Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.¶

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.5891470005448702  with a P-value of P = 4.4570195020504053e-20

Since the p-value is < 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.589).

Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.¶

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8096565575365611  with a P-value of P =  1.0468839625927006e-47

Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.¶

pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.6910440897821903  with a P-value of P =  9.96096322234889e-30

Since the p-value is < 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value )

The Pearson Correlation Coefficient is 0.7527948631832608  with a P-value of P = 8.256714148309272e-38

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.752).

Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8344204348498461  with a P-value of P =  3.9699775360220333e-53

Since the p-value is < 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834)

Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

The Pearson Correlation Coefficient is 0.8723367498521142  with a P-value of P = 1.8977171466563487e-63

Since the p-value is < 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value )

The Pearson Correlation Coefficient is 0.5434325935555682  with a P-value of P =   9.207487524195266e-17

Since the p-value is < 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

Let's calculate the Pearson Correlation Coefficient and P-value of 'city-mpg' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is -0.6871861020862686  with a P-value of P =  2.729256568479228e-29

Since the p-value is < 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.

Let's calculate the Pearson Correlation Coefficient and P-value of 'highway-mpg' and 'price':¶

pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value )

The Pearson Correlation Coefficient is -0.7051147088046402  with a P-value of P =  2.197326053158553e-31

Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of about -0.705 shows that the relationship is negative and moderately strong.

Conclusion:¶

Important continous numerical variables:

Length
Width
Curb-weight
Engine-size
Horsepower
City-mpg
Highway-mpg
Wheel-base
Bore

Important categorical variables:

Drive-wheels

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	...	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	fuel-type-diesel	fuel-type-gas	aspiration-std	aspiration-turbo
symboling	1.000000	0.469772	-0.529145	-0.364511	-0.237262	-0.542261	-0.234743	-0.112069	-0.145790	0.008072	...	0.074617	0.284033	-0.030158	0.041248	-0.083327	0.062423	-0.195381	0.195381	0.050995	-0.050995
normalized-losses	0.469772	1.000000	-0.057068	0.019433	0.086961	-0.377664	0.099404	0.112362	-0.029929	0.055673	...	0.217322	0.239580	-0.225255	-0.182011	0.133999	0.238712	-0.101574	0.101574	0.006915	-0.006915
wheel-base	-0.529145	-0.057068	1.000000	0.879005	0.814593	0.583789	0.787584	0.576779	0.501576	0.144733	...	0.375610	-0.365045	-0.480029	-0.552211	0.589147	0.484047	0.306706	-0.306706	-0.254907	0.254907
length	-0.364511	0.019433	0.879005	1.000000	0.857271	0.492955	0.881058	0.685531	0.610847	0.120929	...	0.580583	-0.286688	-0.667658	-0.700186	0.691044	0.659174	0.210616	-0.210616	-0.229294	0.229294
width	-0.237262	0.086961	0.814593	0.857271	1.000000	0.300995	0.867720	0.731100	0.548485	0.182864	...	0.617115	-0.247337	-0.638155	-0.684700	0.752795	0.677111	0.243448	-0.243448	-0.304459	0.304459
height	-0.542261	-0.377664	0.583789	0.492955	0.300995	1.000000	0.310660	0.076255	0.187917	-0.081103	...	-0.085829	-0.315820	-0.057087	-0.111568	0.137284	0.008923	0.281198	-0.281198	-0.086531	0.086531
curb-weight	-0.234743	0.099404	0.787584	0.881058	0.867720	0.310660	1.000000	0.849090	0.644552	0.168669	...	0.758076	-0.279422	-0.750390	-0.795515	0.834420	0.785868	0.221082	-0.221082	-0.322097	0.322097
engine-size	-0.112069	0.112362	0.576779	0.685531	0.731100	0.076255	0.849090	1.000000	0.572878	0.208133	...	0.822689	-0.256681	-0.651002	-0.679877	0.872337	0.745337	0.070925	-0.070925	-0.110278	0.110278
bore	-0.145790	-0.029929	0.501576	0.610847	0.548485	0.187917	0.644552	0.572878	1.000000	-0.051087	...	0.566807	-0.267061	-0.581272	-0.590672	0.543433	0.553954	0.055395	-0.055395	-0.229338	0.229338
stroke	0.008072	0.055673	0.144733	0.120929	0.182864	-0.081103	0.168669	0.208133	-0.051087	1.000000	...	0.100881	-0.066021	-0.040547	-0.040170	0.083298	0.041310	0.240046	-0.240046	-0.215805	0.215805
compression-ratio	-0.181073	-0.114738	0.249689	0.159203	0.189008	0.259526	0.156444	0.029005	0.002034	0.186780	...	-0.214260	-0.436303	0.330897	0.267929	0.071176	-0.298898	0.985228	-0.985228	-0.307074	0.307074
horsepower	0.074617	0.217322	0.375610	0.580583	0.617115	-0.085829	0.758076	0.822689	0.566807	0.100881	...	1.000000	0.108163	-0.822488	-0.804702	0.809657	0.889613	-0.168755	0.168755	-0.251799	0.251799
peak-rpm	0.284033	0.239580	-0.365045	-0.286688	-0.247337	-0.315820	-0.279422	-0.256681	-0.267061	-0.066021	...	0.108163	1.000000	-0.116364	-0.059319	-0.101593	0.116528	-0.476430	0.476430	0.190772	-0.190772
city-mpg	-0.030158	-0.225255	-0.480029	-0.667658	-0.638155	-0.057087	-0.750390	-0.651002	-0.581272	-0.040547	...	-0.822488	-0.116364	1.000000	0.972024	-0.687186	-0.949692	0.264947	-0.264947	0.191068	-0.191068
highway-mpg	0.041248	-0.182011	-0.552211	-0.700186	-0.684700	-0.111568	-0.795515	-0.679877	-0.590672	-0.040170	...	-0.804702	-0.059319	0.972024	1.000000	-0.705115	-0.929940	0.197989	-0.197989	0.243429	-0.243429
price	-0.083327	0.133999	0.589147	0.691044	0.752795	0.137284	0.834420	0.872337	0.543433	0.083298	...	0.809657	-0.101593	-0.687186	-0.705115	1.000000	0.790291	0.110417	-0.110417	-0.179762	0.179762
city-L/100km	0.062423	0.238712	0.484047	0.659174	0.677111	0.008923	0.785868	0.745337	0.553954	0.041310	...	0.889613	0.116528	-0.949692	-0.929940	0.790291	1.000000	-0.240676	0.240676	-0.158912	0.158912
fuel-type-diesel	-0.195381	-0.101574	0.306706	0.210616	0.243448	0.281198	0.221082	0.070925	0.055395	0.240046	...	-0.168755	-0.476430	0.264947	0.197989	0.110417	-0.240676	1.000000	-1.000000	-0.407787	0.407787
fuel-type-gas	0.195381	0.101574	-0.306706	-0.210616	-0.243448	-0.281198	-0.221082	-0.070925	-0.055395	-0.240046	...	0.168755	0.476430	-0.264947	-0.197989	-0.110417	0.240676	-1.000000	1.000000	0.407787	-0.407787
aspiration-std	0.050995	0.006915	-0.254907	-0.229294	-0.304459	-0.086531	-0.322097	-0.110278	-0.229338	-0.215805	...	-0.251799	0.190772	0.191068	0.243429	-0.179762	-0.158912	-0.407787	0.407787	1.000000	-1.000000
aspiration-turbo	-0.050995	-0.006915	0.254907	0.229294	0.304459	0.086531	0.322097	0.110278	0.229338	0.215805	...	0.251799	-0.190772	-0.191068	-0.243429	0.179762	0.158912	0.407787	-0.407787	-1.000000	1.000000

	bore	stroke	compression-ratio	horsepower
bore	1.000000	-0.051087	0.002034	0.566807
stroke	-0.051087	1.000000	0.186780	0.100881
compression-ratio	0.002034	0.186780	1.000000	-0.214260
horsepower	0.566807	0.100881	-0.214260	1.000000

	symboling	normalized-losses	wheel-base	length	width	height	curb-weight	engine-size	bore	stroke	...	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	fuel-type-diesel	fuel-type-gas	aspiration-std	aspiration-turbo
count	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	...	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000	200.000000
mean	0.830000	122.000000	98.848000	0.837232	0.915250	0.899523	2555.705000	126.860000	3.329981	3.259816	...	103.365000	5118.259901	25.200000	30.705000	13205.690000	9.937914	0.100000	0.900000	0.820000	0.180000
std	1.248557	32.076542	6.038261	0.059333	0.029207	0.040610	518.594552	41.650501	0.268562	0.314177	...	37.455525	479.240743	6.432487	6.827227	7966.982558	2.539415	0.300753	0.300753	0.385152	0.385152
min	-2.000000	65.000000	86.600000	0.678039	0.837500	0.799331	1488.000000	61.000000	2.540000	2.070000	...	48.000000	4150.000000	13.000000	16.000000	5118.000000	4.795918	0.000000	0.000000	0.000000	0.000000
25%	0.000000	100.250000	94.500000	0.800937	0.891319	0.869565	2163.000000	97.750000	3.150000	3.117500	...	70.000000	4800.000000	19.000000	25.000000	7775.000000	7.833333	0.000000	1.000000	1.000000	0.000000
50%	1.000000	122.000000	97.000000	0.832292	0.909722	0.904682	2414.000000	119.500000	3.310000	3.290000	...	95.000000	5162.995050	24.000000	30.000000	10270.000000	9.791667	0.000000	1.000000	1.000000	0.000000
75%	2.000000	138.250000	102.400000	0.881788	0.926042	0.928512	2928.250000	142.000000	3.582500	3.410000	...	116.000000	5500.000000	30.000000	34.000000	16500.750000	12.368421	0.000000	1.000000	1.000000	0.000000
max	3.000000	256.000000	120.900000	1.000000	1.000000	1.000000	4066.000000	326.000000	3.940000	4.170000	...	262.000000	6600.000000	49.000000	54.000000	45400.000000	18.076923	1.000000	1.000000	1.000000	1.000000

	make	num-of-doors	body-style	drive-wheels	engine-location	engine-type	num-of-cylinders	fuel-system
count	200	200	200	200	200	200	200	200
unique	22	2	5	3	2	6	7	8
top	toyota	four	sedan	fwd	front	ohc	four	mpfi
freq	32	115	94	118	197	145	156	91