Car Evaluation Sample Project - Part 2

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Second Step: Data Wrangling

Identifying and handling missing values

1. Identify missing data

In our dataset, the missing data comes with the question mark "?", so let's replace ? with NaN:

In [12]:
df.replace("?", np.nan, inplace = True)
df.head(5)
Out[12]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
4 2 NaN audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250

5 rows × 26 columns

In [13]:
missing_data = df.isnull()
missing_data.head(10)
Out[13]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False True False False False False False False False False ... False False False False False False False False False False
5 False False False False False False False False False False ... False False False False False False False False False False
6 False True False False False False False False False False ... False False False False False False False False False False
7 False False False False False False False False False False ... False False False False False False False False False False
8 False True False False False False False False False False ... False False False False False False False False False True
9 False False False False False False False False False False ... False False False False False False False False False False

10 rows × 26 columns

In [14]:
missing_data.tail(10)
Out[14]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
194 False False False False False False False False False False ... False False False False False False False False False False
195 False False False False False False False False False False ... False False False False False False False False False False
196 False False False False False False False False False False ... False False False False False False False False False False
197 False False False False False False False False False False ... False False False False False False False False False False
198 False False False False False False False False False False ... False False False False False False False False False False
199 False False False False False False False False False False ... False False False False False False False False False False
200 False False False False False False False False False False ... False False False False False False False False False False
201 False False False False False False False False False False ... False False False False False False False False False False
202 False False False False False False False False False False ... False False False False False False False False False False
203 False False False False False False False False False False ... False False False False False False False False False False

10 rows × 26 columns

Let's count the missing values in each column:

In [15]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")
symboling
False    204
Name: symboling, dtype: int64

normalized-losses
False    164
True      40
Name: normalized-losses, dtype: int64

make
False    204
Name: make, dtype: int64

fuel-type
False    204
Name: fuel-type, dtype: int64

aspiration
False    204
Name: aspiration, dtype: int64

num-of-doors
False    202
True       2
Name: num-of-doors, dtype: int64

body-style
False    204
Name: body-style, dtype: int64

drive-wheels
False    204
Name: drive-wheels, dtype: int64

engine-location
False    204
Name: engine-location, dtype: int64

wheel-base
False    204
Name: wheel-base, dtype: int64

length
False    204
Name: length, dtype: int64

width
False    204
Name: width, dtype: int64

height
False    204
Name: height, dtype: int64

curb-weight
False    204
Name: curb-weight, dtype: int64

engine-type
False    204
Name: engine-type, dtype: int64

num-of-cylinders
False    204
Name: num-of-cylinders, dtype: int64

engine-size
False    204
Name: engine-size, dtype: int64

fuel-system
False    204
Name: fuel-system, dtype: int64

bore
False    200
True       4
Name: bore, dtype: int64

stroke
False    200
True       4
Name: stroke, dtype: int64

compression-ratio
False    204
Name: compression-ratio, dtype: int64

horsepower
False    202
True       2
Name: horsepower, dtype: int64

peak-rpm
False    202
True       2
Name: peak-rpm, dtype: int64

city-mpg
False    204
Name: city-mpg, dtype: int64

highway-mpg
False    204
Name: highway-mpg, dtype: int64

price
False    200
True       4
Name: price, dtype: int64

Based on the summary above, you can find the below:

  1. "normalized-losses": 40 missing data
  2. "num-of-doors": 2 missing data
  3. "bore": 4 missing data
  4. "stroke" : 4 missing data
  5. "horsepower": 2 missing data
  6. "peak-rpm": 2 missing data
  7. "price": 4 missing data

2. Dealing with missing data

To deal with the missing data, we can choose one way of the below:

  1. Drop data
    a. Drop the whole row
    b. Drop the whole column
  2. Replace data
    a. Replace it by mean
    b. Replace it by frequency
    c. Replace it based on other functions

Here is what we are going to do:

Replace by mean:

  • "normalized-losses": 40 missing data
  • "stroke": 4 missing data
  • "bore": 4 missing data
  • "horsepower": 2 missing data
  • "peak-rpm": 2 missing data

Replace by frequency:

  • "num-of-doors": 2 missing data, replace them with "4".

Drop the whole row:

  • "price": 4 missing data "cannot be used in prediction

[A] Calculate he mean value for the "normalized-losses" column

In [16]:
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
Average of normalized-losses: 122.0

Replace "NaN" with mean value in "normalized-losses" column

In [17]:
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

[B] Calculate the mean value for the "bore" column

In [18]:
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
Average of bore: 3.3290500000000014

Replace "NaN" with the mean value in the "bore" column

In [19]:
df["bore"].replace(np.nan, avg_bore, inplace=True)

[C] Calculate the mean value for the "stroke" column

In [20]:
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)
Average of stroke: 3.258299999999994

Replace "NaN" with the mean value in the "stroke" column

In [21]:
df["stroke"].replace(np.nan, avg_stroke, inplace = True)

[D] Calculate the mean value for the "horsepower" column

In [22]:
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
Average horsepower: 104.22277227722772

Replace "NaN" with the mean value in the "horsepower" column

In [23]:
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

[E] Calculate the mean value for "peak-rpm" column

In [24]:
avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
Average peak rpm: 5125.990099009901

Replace "NaN" with the mean value in the "peak-rpm" column

In [25]:
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)

Now, let's check the number of doors column, since we are going to replace the number of doors by frequency, let's check the count of the variables inside the column:

In [26]:
df['num-of-doors'].value_counts()
Out[26]:
four    114
two      88
Name: num-of-doors, dtype: int64

Replace "NaN" with the most frequent value in the "num-of-doors" column

In [27]:
df["num-of-doors"].replace(np.nan, "four", inplace=True)

Now, let's drop the rows that has no price:

In [28]:
df.dropna(subset=["price"], axis=0, inplace=True)
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)
In [29]:
df.head()
Out[29]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
4 2 122 audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250

5 rows × 26 columns

3. Correct data format

In [30]:
df.dtypes
Out[30]:
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

Let's correct some columns format:

In [31]:
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

Let's recheck again:

In [32]:
df.dtypes
Out[32]:
symboling              int64
normalized-losses      int32
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Data Standardization:

Transform mpg to L/100km:

In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit

We will need to apply data transformation to transform mpg into L/100km.

Fromula is L/100km = 235 / mpg

In [33]:
df['city-L/100km'] = 235/df["city-mpg"]
# check your transformed data 
df.head()
Out[33]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 27 16500.0 11.190476
1 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154 5000.0 19 26 16500.0 12.368421
2 2 164 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102 5500.0 24 30 13950.0 9.791667
3 2 164 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115 5500.0 18 22 17450.0 13.055556
4 2 122 audi gas std two sedan fwd front 99.8 ... mpfi 3.19 3.40 8.5 110 5500.0 19 25 15250.0 12.368421

5 rows × 27 columns

Data Standardization:

Rescaling original value by (original value)/(maximum value)

In [34]:
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()
# show the scaled columns
df[["length","width","height"]].head()
Out[34]:
length width height
0 0.811148 0.890278 0.816054
1 0.822681 0.909722 0.876254
2 0.848630 0.919444 0.908027
3 0.848630 0.922222 0.908027
4 0.851994 0.920833 0.887960

Data Binning:

transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis

In [35]:
df["horsepower"]=df["horsepower"].astype(int, copy=True)
In [36]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Out[36]:
Text(0.5, 1.0, 'horsepower bins')
In [37]:
#We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. 
#The values will determine when one bin ends and another begins.
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins
Out[37]:
array([ 48.        , 119.33333333, 190.66666667, 262.        ])
In [38]:
#set group names
group_names = ['Low', 'Medium', 'High']
In [39]:
#We apply the function "cut" to determine what each value of `df['horsepower']` belongs to.
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
Out[39]:
horsepower horsepower-binned
0 111 Low
1 154 Medium
2 102 Low
3 115 Low
4 110 Low
5 110 Low
6 110 Low
7 140 Medium
8 101 Low
9 101 Low
10 121 Medium
11 121 Medium
12 121 Medium
13 182 Medium
14 182 Medium
15 182 Medium
16 48 Low
17 70 Low
18 70 Low
19 68 Low
In [40]:
#check number of cars in each bin
df["horsepower-binned"].value_counts()
Out[40]:
Low       152
Medium     43
High        5
Name: horsepower-binned, dtype: int64
In [41]:
#plotting the distribution of each bin
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Out[41]:
Text(0.5, 1.0, 'horsepower bins')

Indicator Variables

We use indicator variables so we can use categorical variables for regression analysis in the later modules

Let's check our columns again:
In [42]:
df.columns
Out[42]:
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],
      dtype='object')
We'll create another 2 columns representing the categories of fuel (gas or deisel)
In [43]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()
Out[43]:
diesel gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
We'll change the column names for clarity
In [44]:
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()
Out[44]:
fuel-type-diesel fuel-type-gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
Now, we have to merge the new columns in our dataframe, and drop the "fuel-type" column
In [45]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)
In [46]:
# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

We will do the same to the aspiration column:

In [47]:
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])
# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()
Out[47]:
aspiration-std aspiration-turbo
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
In [48]:
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)
# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)

Let's check finally our clean dataframe:

In [49]:
df.head()
Out[49]:
symboling normalized-losses make num-of-doors body-style drive-wheels engine-location wheel-base length width ... peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
0 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 5000.0 21 27 16500.0 11.190476 Low 0 1 1 0
1 1 122 alfa-romero two hatchback rwd front 94.5 0.822681 0.909722 ... 5000.0 19 26 16500.0 12.368421 Medium 0 1 1 0
2 2 164 audi four sedan fwd front 99.8 0.848630 0.919444 ... 5500.0 24 30 13950.0 9.791667 Low 0 1 1 0
3 2 164 audi four sedan 4wd front 99.4 0.848630 0.922222 ... 5500.0 18 22 17450.0 13.055556 Low 0 1 1 0
4 2 122 audi two sedan fwd front 99.8 0.851994 0.920833 ... 5500.0 19 25 15250.0 12.368421 Low 0 1 1 0

5 rows × 30 columns

We can now save this clean dataframe to a new csv:

In [50]:
df.to_csv('clean_df.csv')