Car Evaluation Sample Project - Part 2

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Second Step: Data Wrangling¶

Identifying and handling missing values¶

1. Identify missing data¶

In our dataset, the missing data comes with the question mark "?", so let's replace ? with NaN:

df.replace("?", np.nan, inplace = True)
df.head(5)

missing_data = df.isnull()
missing_data.head(10)

missing_data.tail(10)

Let's count the missing values in each column:¶

for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

symboling
False    204
Name: symboling, dtype: int64

normalized-losses
False    164
True      40
Name: normalized-losses, dtype: int64

make
False    204
Name: make, dtype: int64

fuel-type
False    204
Name: fuel-type, dtype: int64

aspiration
False    204
Name: aspiration, dtype: int64

num-of-doors
False    202
True       2
Name: num-of-doors, dtype: int64

body-style
False    204
Name: body-style, dtype: int64

drive-wheels
False    204
Name: drive-wheels, dtype: int64

engine-location
False    204
Name: engine-location, dtype: int64

wheel-base
False    204
Name: wheel-base, dtype: int64

length
False    204
Name: length, dtype: int64

width
False    204
Name: width, dtype: int64

height
False    204
Name: height, dtype: int64

curb-weight
False    204
Name: curb-weight, dtype: int64

engine-type
False    204
Name: engine-type, dtype: int64

num-of-cylinders
False    204
Name: num-of-cylinders, dtype: int64

engine-size
False    204
Name: engine-size, dtype: int64

fuel-system
False    204
Name: fuel-system, dtype: int64

bore
False    200
True       4
Name: bore, dtype: int64

stroke
False    200
True       4
Name: stroke, dtype: int64

compression-ratio
False    204
Name: compression-ratio, dtype: int64

horsepower
False    202
True       2
Name: horsepower, dtype: int64

peak-rpm
False    202
True       2
Name: peak-rpm, dtype: int64

city-mpg
False    204
Name: city-mpg, dtype: int64

highway-mpg
False    204
Name: highway-mpg, dtype: int64

price
False    200
True       4
Name: price, dtype: int64

Based on the summary above, you can find the below:

"normalized-losses": 40 missing data
"num-of-doors": 2 missing data
"bore": 4 missing data
"stroke" : 4 missing data
"horsepower": 2 missing data
"peak-rpm": 2 missing data
"price": 4 missing data

2. Dealing with missing data¶

To deal with the missing data, we can choose one way of the below:

Drop data
a. Drop the whole row
b. Drop the whole column
Replace data
a. Replace it by mean
b. Replace it by frequency
c. Replace it based on other functions

Here is what we are going to do:¶

Replace by mean:

"normalized-losses": 40 missing data
"stroke": 4 missing data
"bore": 4 missing data
"horsepower": 2 missing data
"peak-rpm": 2 missing data

Replace by frequency:

"num-of-doors": 2 missing data, replace them with "4".

Drop the whole row:

"price": 4 missing data "cannot be used in prediction

[A] Calculate he mean value for the "normalized-losses" column

avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)

Average of normalized-losses: 122.0

Replace "NaN" with mean value in "normalized-losses" column

df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

[B] Calculate the mean value for the "bore" column

avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)

Average of bore: 3.3290500000000014

Replace "NaN" with the mean value in the "bore" column

df["bore"].replace(np.nan, avg_bore, inplace=True)

[C] Calculate the mean value for the "stroke" column

avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)

Average of stroke: 3.258299999999994

Replace "NaN" with the mean value in the "stroke" column

df["stroke"].replace(np.nan, avg_stroke, inplace = True)

[D] Calculate the mean value for the "horsepower" column

avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)

Average horsepower: 104.22277227722772

Replace "NaN" with the mean value in the "horsepower" column

df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

[E] Calculate the mean value for "peak-rpm" column

avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)

Average peak rpm: 5125.990099009901

Replace "NaN" with the mean value in the "peak-rpm" column

df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)

Now, let's check the number of doors column, since we are going to replace the number of doors by frequency, let's check the count of the variables inside the column:¶

df['num-of-doors'].value_counts()

four    114
two      88
Name: num-of-doors, dtype: int64

Replace "NaN" with the most frequent value in the "num-of-doors" column

df["num-of-doors"].replace(np.nan, "four", inplace=True)

Now, let's drop the rows that has no price:¶

df.dropna(subset=["price"], axis=0, inplace=True)
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

df.head()

3. Correct data format¶

df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

Let's correct some columns format:¶

df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

Let's recheck again:¶

df.dtypes

symboling              int64
normalized-losses      int32
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Data Standardization:¶

Transform mpg to L/100km:

In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit

We will need to apply data transformation to transform mpg into L/100km.

Fromula is L/100km = 235 / mpg

df['city-L/100km'] = 235/df["city-mpg"]
# check your transformed data 
df.head()

Data Standardization:¶

Rescaling original value by (original value)/(maximum value)¶

# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()
# show the scaled columns
df[["length","width","height"]].head()

Data Binning:¶

transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis¶

df["horsepower"]=df["horsepower"].astype(int, copy=True)

%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Text(0.5, 1.0, 'horsepower bins')

#We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. 
#The values will determine when one bin ends and another begins.
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins

array([ 48.        , 119.33333333, 190.66666667, 262.        ])

#set group names
group_names = ['Low', 'Medium', 'High']

#We apply the function "cut" to determine what each value of `df['horsepower']` belongs to.
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)

#check number of cars in each bin
df["horsepower-binned"].value_counts()

Low       152
Medium     43
High        5
Name: horsepower-binned, dtype: int64

#plotting the distribution of each bin
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Text(0.5, 1.0, 'horsepower bins')

Indicator Variables¶

We use indicator variables so we can use categorical variables for regression analysis in the later modules¶

Let's check our columns again:¶

df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],
      dtype='object')

We'll create another 2 columns representing the categories of fuel (gas or deisel)¶

dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()

We'll change the column names for clarity¶

dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()

Now, we have to merge the new columns in our dataframe, and drop the "fuel-type" column¶

# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

We will do the same to the aspiration column:¶

# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])
# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)
# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)

Let's check finally our clean dataframe:¶

df.head()

We can now save this clean dataframe to a new csv:¶

df.to_csv('clean_df.csv')

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	NaN	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	NaN	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
3	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
4	2	NaN	audi	gas	std	two	sedan	fwd	front	99.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	122	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
3	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450
4	2	122	audi	gas	std	two	sedan	fwd	front	99.8	...	136	mpfi	3.19	3.40	8.5	110	5500	19	25	15250

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price	city-L/100km
0	3	122	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	mpfi	3.47	2.68	9.0	111	5000.0	21	27	16500.0	11.190476
1	1	122	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	mpfi	2.68	3.47	9.0	154	5000.0	19	26	16500.0	12.368421
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	mpfi	3.19	3.40	10.0	102	5500.0	24	30	13950.0	9.791667
3	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	mpfi	3.19	3.40	8.0	115	5500.0	18	22	17450.0	13.055556
4	2	122	audi	gas	std	two	sedan	fwd	front	99.8	...	mpfi	3.19	3.40	8.5	110	5500.0	19	25	15250.0	12.368421

	length	width	height
0	0.811148	0.890278	0.816054
1	0.822681	0.909722	0.876254
2	0.848630	0.919444	0.908027
3	0.848630	0.922222	0.908027
4	0.851994	0.920833	0.887960

	symboling	normalized-losses	make	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	length	width	...	peak-rpm	city-mpg	highway-mpg	price	city-L/100km	horsepower-binned	fuel-type-gas	aspiration-std
0	3	122	alfa-romero	two	convertible	rwd	front	88.6	0.811148	0.890278	...	5000.0	21	27	16500.0	11.190476	Low	1	1
1	1	122	alfa-romero	two	hatchback	rwd	front	94.5	0.822681	0.909722	...	5000.0	19	26	16500.0	12.368421	Medium	1	1
2	2	164	audi	four	sedan	fwd	front	99.8	0.848630	0.919444	...	5500.0	24	30	13950.0	9.791667	Low	1	1
3	2	164	audi	four	sedan	4wd	front	99.4	0.848630	0.922222	...	5500.0	18	22	17450.0	13.055556	Low	1	1
4	2	122	audi	two	sedan	fwd	front	99.8	0.851994	0.920833	...	5500.0	19	25	15250.0	12.368421	Low	1	1

	symboling	normalized-losses	make	fuel-type	aspiration	num-of-doors	body-style	drive-wheels	engine-location	wheel-base	...	engine-size	fuel-system	bore	stroke	compression-ratio	horsepower	peak-rpm	city-mpg	highway-mpg	price
0	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
6	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
8	False	True	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	True
9	False	False	False	False	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

	diesel	gas
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1