Housing value estimation model training¶
Let's train a simple regressor using Scikit-Learn, and convert the pipeline to ONNX format.
In [1]:
from pathlib import Path
import numpy as np
import onnxruntime as ort
import pandas as pd
import skl2onnx
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
/tmp/ipykernel_1984/2968630966.py:5: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
Load the french housing dataset for Isère department in 2022:
In [2]:
dvf_38 = pd.read_csv(
"https://files.data.gouv.fr/geo-dvf/latest/csv/2022/departements/38.csv.gz"
)
dvf_38.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 75871 entries, 0 to 75870 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id_mutation 75871 non-null object 1 date_mutation 75871 non-null object 2 numero_disposition 75871 non-null int64 3 nature_mutation 75871 non-null object 4 valeur_fonciere 75504 non-null float64 5 adresse_numero 48243 non-null float64 6 adresse_suffixe 2482 non-null object 7 adresse_nom_voie 74526 non-null object 8 adresse_code_voie 74529 non-null object 9 code_postal 74528 non-null float64 10 code_commune 75871 non-null int64 11 nom_commune 75871 non-null object 12 code_departement 75871 non-null int64 13 ancien_code_commune 0 non-null float64 14 ancien_nom_commune 0 non-null float64 15 id_parcelle 75871 non-null object 16 ancien_id_parcelle 0 non-null float64 17 numero_volume 150 non-null float64 18 lot1_numero 33401 non-null object 19 lot1_surface_carrez 8391 non-null float64 20 lot2_numero 9313 non-null float64 21 lot2_surface_carrez 3167 non-null float64 22 lot3_numero 1215 non-null float64 23 lot3_surface_carrez 208 non-null float64 24 lot4_numero 353 non-null float64 25 lot4_surface_carrez 48 non-null float64 26 lot5_numero 180 non-null float64 27 lot5_surface_carrez 29 non-null float64 28 nombre_lots 75871 non-null int64 29 code_type_local 46068 non-null float64 30 type_local 46068 non-null object 31 surface_reelle_bati 25589 non-null float64 32 nombre_pieces_principales 46034 non-null float64 33 code_nature_culture 42130 non-null object 34 nature_culture 42130 non-null object 35 code_nature_culture_speciale 2945 non-null object 36 nature_culture_speciale 2945 non-null object 37 surface_terrain 42130 non-null float64 38 longitude 73674 non-null float64 39 latitude 73674 non-null float64 dtypes: float64(22), int64(4), object(14) memory usage: 23.2+ MB
/tmp/ipykernel_1984/1738013210.py:1: DtypeWarning: Columns (18) have mixed types. Specify dtype option on import or set low_memory=False. dvf_38 = pd.read_csv(
Prepare the dataset to keep only sales of apartments in Grenoble:
In [3]:
dataset = dvf_38.copy()
dataset = dataset[
(dataset.nature_mutation == "Vente")
& (dataset.type_local == "Appartement")
& (dataset.nom_commune == "Grenoble")
]
dataset = dataset[
[
"surface_reelle_bati",
"nombre_pieces_principales",
"latitude",
"longitude",
"valeur_fonciere",
]
]
dataset = dataset.rename(
columns={
"surface_reelle_bati": "area",
"nombre_pieces_principales": "rooms",
"valeur_fonciere": "value",
}
)
dataset = dataset.dropna()
dataset = dataset.reset_index()
dataset
Out[3]:
index | area | rooms | latitude | longitude | value | |
---|---|---|---|---|---|---|
0 | 1 | 70.0 | 3.0 | 45.176163 | 5.719166 | 225000.0 |
1 | 6 | 109.0 | 4.0 | 45.187065 | 5.718309 | 257900.0 |
2 | 15 | 54.0 | 2.0 | 45.181912 | 5.711105 | 151500.0 |
3 | 26 | 97.0 | 5.0 | 45.173124 | 5.708733 | 160000.0 |
4 | 31 | 31.0 | 1.0 | 45.182767 | 5.743471 | 87000.0 |
... | ... | ... | ... | ... | ... | ... |
3523 | 44672 | 54.0 | 3.0 | 45.179669 | 5.717220 | 165500.0 |
3524 | 44679 | 74.0 | 5.0 | 45.180877 | 5.711429 | 127000.0 |
3525 | 44688 | 61.0 | 3.0 | 45.166853 | 5.726352 | 110000.0 |
3526 | 44691 | 73.0 | 3.0 | 45.181464 | 5.720759 | 192000.0 |
3527 | 44692 | 57.0 | 4.0 | 45.169246 | 5.723737 | 112420.0 |
3528 rows × 6 columns
Split the dataset for train and test sets:
In [4]:
X = dataset[
["area", "rooms", "latitude", "longitude"]
]
y = dataset["value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Train a Scikit-Learn pipeline, including the normalization step and a regression model:
In [5]:
pipeline = Pipeline(
[
("scaler", StandardScaler()),
("regressor", LinearRegression()),
]
)
pipeline.fit(X_train, y_train)
Out[5]:
Pipeline(steps=[('scaler', StandardScaler()), ('regressor', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('regressor', LinearRegression())])
StandardScaler()
LinearRegression()
Score the model (RMSE) on the test set:
In [6]:
root_mean_squared_error(y_test, pipeline.predict(X_test))
Out[6]:
290714.8123380589
Try to predict the value of an apartment (50m2, 3 rooms, Victor Hugo place in Grenoble):
In [7]:
pipeline.predict([[50, 3, 45.1893525, 5.7216074]])
/home/runner/.local/share/virtualenvs/2024-02-23-ml-models-web-GaDE5OIw/lib/python3.11/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
Out[7]:
array([264748.35412483])
Export the model to ONNX format using skl2onnx
:
In [8]:
onnx_model = skl2onnx.to_onnx(pipeline, X_train[:1].astype(np.float32))
onnx_model_path = Path() / "model.onnx"
onnx_model_path.write_bytes(onnx_model.SerializeToString())
Out[8]:
580
Load the ONNX model and run an inference on the sample data:
In [9]:
session = ort.InferenceSession(onnx_model_path, providers=ort.get_available_providers())
session.run(
None,
{
"area": [[50.0]],
"rooms": [[3.0]],
"latitude": [[45.1893525]],
"longitude": [[5.7216074]],
},
)
Out[9]:
[array([[264750.3]], dtype=float32)]