Sentiment analysis model training¶

Let's train a simple binary classifier using Scikit-Learn, and convert the pipeline to ONNX format.

In [1]:
from pathlib import Path

import nltk.corpus
import onnxruntime as ort
import pandas as pd
import skl2onnx
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
/tmp/ipykernel_2019/3651809897.py:5: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

Download and prepare the dataset from NLTK movie reviews:

In [2]:
nltk.download("movie_reviews")
dataset_classes = nltk.corpus.movie_reviews.categories()
dataset = pd.DataFrame(
    [
        {
            "text": nltk.corpus.movie_reviews.raw(fileid),
            "sentiment": fileid.split("/")[0],
        }
        for fileid in nltk.corpus.movie_reviews.fileids()
    ]
)
dataset
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/runner/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Out[2]:
text sentiment
0 plot : two teen couples go to a church party ,... neg
1 the happy bastard's quick movie review \ndamn ... neg
2 it is movies like these that make a jaded movi... neg
3 " quest for camelot " is warner bros . ' firs... neg
4 synopsis : a mentally unstable man undergoing ... neg
... ... ...
1995 wow ! what a movie . \nit's everything a movie... pos
1996 richard gere can be a commanding actor , but h... pos
1997 glory--starring matthew broderick , denzel was... pos
1998 steven spielberg's second epic film on world w... pos
1999 truman ( " true-man " ) burbank is the perfect... pos

2000 rows × 2 columns

Split the dataset for train and test sets:

In [3]:
X = dataset["text"]
y = dataset["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Train a Scikit-Learning pipeline, including vectorization / normalization step and a classification model:

In [4]:
pipeline = Pipeline(
    [
        ("tf-idf", TfidfVectorizer()),
        ("classifier", LogisticRegression()),
    ]
)
pipeline.fit(X_train, y_train)
Out[4]:
Pipeline(steps=[('tf-idf', TfidfVectorizer()),
                ('classifier', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tf-idf', TfidfVectorizer()),
                ('classifier', LogisticRegression())])
TfidfVectorizer()
LogisticRegression()

Score the model (accuracy) on the test set:

In [5]:
pipeline.score(X_test, y_test)
Out[5]:
0.848

Try to predict the sentiment of sample sentences:

In [6]:
pipeline.predict(["a nice and good take"])
Out[6]:
array(['pos'], dtype=object)
In [7]:
pipeline.predict(["it hurts so bad"])
Out[7]:
array(['neg'], dtype=object)

Export the model to ONNX format using skl2onnx:

In [8]:
onnx_options = {id(pipeline): {"zipmap": False, "output_class_labels": True}}
onnx_model = skl2onnx.to_onnx(pipeline, X_train[:1].values, options=onnx_options)
onnx_model_path = Path() / "model.onnx"
onnx_model_path.write_bytes(onnx_model.SerializeToString())
Out[8]:
1130676

Load the ONNX model and run inference on a sample sentence:

In [9]:
session = ort.InferenceSession(onnx_model_path, providers=ort.get_available_providers())
session.run(None, {"X": ["it hurts so bad"]})
Out[9]:
[array(['neg'], dtype=object),
 array([[0.76245755, 0.23754245]], dtype=float32),
 array(['neg', 'pos'], dtype=object)]