Car Evaluation Sample Project - Part 1

Car Evaluation Dataset: https://archive.ics.uci.edu/ml/datasets/Car+Evaluation

First Step: Data Acquisition

Loading the dataset into pandas dataframe

In [51]:
import pandas as pd
import numpy as np
In [3]:
path = "https://github.com/mmouty/energynds/blob/main/energynds/assets/datasets/auto.csv"
In [4]:
df = pd.read_csv(path)

Checking our dataframe

In [5]:
print("The first 5 rows of the dataframe")
df.head(5)
The first 5 rows of the dataframe
Out[5]:
3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
4 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250

5 rows × 26 columns

In [6]:
print("The last 5 rows of the dataframe")
df.tail(5)
The last 5 rows of the dataframe
Out[6]:
3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
199 -1 95 volvo gas std four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 23 28 16845
200 -1 95 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 8.7 160 5300 19 25 19045
201 -1 95 volvo gas std four sedan rwd front 109.1 ... 173 mpfi 3.58 2.87 8.8 134 5500 18 23 21485
202 -1 95 volvo diesel turbo four sedan rwd front 109.1 ... 145 idi 3.01 3.40 23.0 106 4800 26 27 22470
203 -1 95 volvo gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 19 25 22625

5 rows × 26 columns

Since the dataset has no headers, we are going to add headers to the dataset:

In [7]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)
headers
 ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

Now, we will replace the headers, and recheck our dataframe:

In [8]:
df.columns = headers
df.head(3)
Out[8]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950

3 rows × 26 columns

We can now save the dataframe into a new csv file:

In [9]:
df.to_csv("auto_with_headers.csv", index=False)

Let's take an overview about the dataset

In [10]:
df.describe(include = "all")
Out[10]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 204.000000 204 204 204 204 204 204 204 204 204.000000 ... 204.000000 204 204 204 204.000000 204 204 204.000000 204.000000 204
unique NaN 52 22 2 2 3 5 3 2 NaN ... NaN 8 39 37 NaN 60 24 NaN NaN 186
top NaN ? toyota gas std four sedan fwd front NaN ... NaN mpfi 3.62 3.40 NaN 68 5500 NaN NaN ?
freq NaN 40 32 184 167 114 96 120 201 NaN ... NaN 93 23 20 NaN 19 37 NaN NaN 4
mean 0.823529 NaN NaN NaN NaN NaN NaN NaN NaN 98.806373 ... 126.892157 NaN NaN NaN 10.148137 NaN NaN 25.240196 30.769608 NaN
std 1.239035 NaN NaN NaN NaN NaN NaN NaN NaN 5.994144 ... 41.744569 NaN NaN NaN 3.981000 NaN NaN 6.551513 6.898337 NaN
min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000 ... 61.000000 NaN NaN NaN 7.000000 NaN NaN 13.000000 16.000000 NaN
25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000 ... 97.000000 NaN NaN NaN 8.575000 NaN NaN 19.000000 25.000000 NaN
50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000 ... 119.500000 NaN NaN NaN 9.000000 NaN NaN 24.000000 30.000000 NaN
75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000 ... 142.000000 NaN NaN NaN 9.400000 NaN NaN 30.000000 34.500000 NaN
max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000 ... 326.000000 NaN NaN NaN 23.000000 NaN NaN 49.000000 54.000000 NaN

11 rows × 26 columns

Checking basic data

In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   symboling          204 non-null    int64
 1   normalized-losses  204 non-null    object
 2   make               204 non-null    object
 3   fuel-type          204 non-null    object
 4   aspiration         204 non-null    object
 5   num-of-doors       204 non-null    object
 6   body-style         204 non-null    object
 7   drive-wheels       204 non-null    object
 8   engine-location    204 non-null    object
 9   wheel-base         204 non-null    float64
 10  length             204 non-null    float64
 11  width              204 non-null    float64
 12  height             204 non-null    float64
 13  curb-weight        204 non-null    int64
 14  engine-type        204 non-null    object
 15  num-of-cylinders   204 non-null    object
 16  engine-size        204 non-null    int64
 17  fuel-system        204 non-null    object
 18  bore               204 non-null    object
 19  stroke             204 non-null    object
 20  compression-ratio  204 non-null    float64
 21  horsepower         204 non-null    object
 22  peak-rpm           204 non-null    object
 23  city-mpg           204 non-null    int64
 24  highway-mpg        204 non-null    int64
 25  price              204 non-null    object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.6+ KB