class: center, middle # Pandas & Scikit-learn: Secret Best Friends! ## Christian Hudon | JDA Labs ??? Assumptions... --- class: center, middle # I ♥ Scikit-learn ??? Why? Let's say you have some data... --- .center[![Iris dataset. Classification problem with all-numerical inputs.](img/numerical_data.png)] ??? Iris dataset. All numerical. --- class: middle ```python dataset = load_iris() X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target) model = LogisticRegression() model.fit(X_train, y_train) ``` ??? Easy to learn, etc. Able to very get to "I've trained a model on this data." But, unless problem very easy, will need some feature engineering... --- class: center, middle # I _especially_ ♥ scikit-learn's Pipelines! --- class: middle ```python dataset = load_iris() *model = Pipeline([ * # As many other pipeline steps here as you would like... * ('pca', PCA()), * ('rf', RandomForestClassifier())]) hp_dist = {'pca__n_components': randint(2, 5), 'rf__max_depth': [3, None], 'rf__max_features': [2, None], 'rf__min_samples_split': randint(2, 5), 'rf__min_samples_leaf': randint(1, 5), 'rf__bootstrap': [True, False], 'rf__criterion': ['gini', 'entropy']} *cross_valid_split = KFold(n_splits=5, shuffle=True) *model_hpsearch = RandomizedSearchCV(model, hp_dist, * cv=cross_valid_split) model_hpsearch.fit(dataset.data, dataset.target) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` --- .center[![Iris dataset. Classification problem with all-numerical inputs.](img/numerical_data.png)] ??? But sometimes your data looks less like a big grid of numbers... --- .center[![Retail data, with date, categorical columns in addition to numerical.](img/complex_data.png)] ??? ... and more like this. --- class: center, middle # I ♥ Pandas ??? Pandas really shines for exploring this kind of data... But let's say you're explored enough and are ready to train a first model --- ```python dataset = load_iris() model = Pipeline([ # As many other pipeline steps here as you would like... ('pca', PCA()), ('rf', RandomForestClassifier())]) hp_dist = {'pca__n_components': randint(2, 5), 'rf__max_depth': [3, None], 'rf__max_features': [2, None], 'rf__min_samples_split': randint(2, 5), 'rf__min_samples_leaf': randint(1, 5), 'rf__bootstrap': [True, False], 'rf__criterion': ['gini', 'entropy']} cross_valid_split = KFold(n_splits=5, shuffle=True) model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=cross_valid_split) model_hpsearch.fit(dataset.data, dataset.target) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` --- ```python *dataset = load_complicated_retail_dataset() *TARGET = 'total_unit_sales' model = Pipeline([ # Add / remove features engineering steps here! ('pca', PCA()), ('rf', RandomForestRegressor())]) hp_dist = {'pca__n_components': randint(2, 5), 'rf__max_depth': [3, None], 'rf__max_features': [2, None], 'rf__min_samples_split': randint(2, 5), 'rf__min_samples_leaf': randint(1, 5), 'rf__bootstrap': [True, False], 'rf__criterion': ['gini', 'entropy']} cross_valid_split = KFold(n_splits=5, shuffle=True) model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=cross_valid_split) *model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET]) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` ??? Well, a Pandas dataframe doesn't behave like a numpy array... --- ```python dataset = load_complicated_retail_dataset() TARGET = 'total_unit_sales' *model = RandomForestRegressor() hp_dist = {'max_depth': [3, None], 'max_features': [2, None], 'min_samples_split': randint(2, 5), 'min_samples_leaf': randint(1, 5), 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']} cross_valid_split = KFold(n_splits=5, shuffle=True) model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=cross_valid_split) model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET]) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` --- ```python dataset = load_complicated_retail_dataset() model = RandomForestRegressor() *model.fit(dataset.drop(TARGET, axis=1), dataset[TARGET]) ``` ??? This *almost* works... --- ```python dataset = load_complicated_retail_dataset() TARGET = 'total_unit_sales' *model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies() *target_values = dataset[TARGET].values model = RandomForestRegressor() model.fit(model_matrix, target_values) ``` ??? Now *this* works. But not too surprising... --- ```python dataset = load_complicated_retail_dataset() TARGET = 'total_unit_sales' *# ☝︎☝︎ Pandas above ☝︎☝︎ model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies() target_values = dataset[TARGET].values *# ☟☟ Scikit-learn below ☟☟ model = RandomForestRegressor() model.fit(model_matrix, target_values) ``` ??? If don't mix, convert to numpy array... yes, things work. --- ```python dataset = load_complicated_retail_dataset() TARGET = 'total_unit_sales' *dataset.assign(feature1=dataset.groupby(...).transform(...), * feature2=...) *# More pandas transformations here... # ☝︎☝︎ Pandas above ☝︎☝︎ model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies() target_values = dataset[TARGET].values # ☟☟ Scikit-learn below ☟☟ *model_matrix = PCA().fit_transform(model_matrix) *# More scikit-learn, numpy-based transforms here... model = RandomForestRegressor() model.fit(model_matrix, target_values) ``` --- # Lost with simple solution With pandas & scikit-learn kept separated, can't really use: 1. Splits 2. Pipelines ??? And then, because of that: model selection machinery, etc. --- class: center, middle # Can we do better? ??? Can we mix pandas and scikit-learn together? Can they work together? --- class: center, middle # Yes! ## (Used not to be the case, though.) --- class: center, middle # Four Periods 1. No (< 0.15) 2. Hmm. (0.15-0.16.1) 3. Yes... (≥ 0.16.1) 4. YES! (≥ 0.18) --- class: center, middle # 1. Splits (and model selection) --- # The doc ![Documentation of train_test_split()](img/splitter_doc.png) --- # The code ```python def indexable(*iterables): """Make arrays indexable for cross-validation. [...]""" result = [] for X in iterables: if sp.issparse(X): result.append(X.tocsr()) * elif hasattr(X, "__getitem__") or hasattr(X, "iloc"): * result.append(X) elif X is None: result.append(X) else: result.append(np.array(X)) check_consistent_length(*result) return result ``` --- # The code (part 2) ```python def safe_indexing(X, indices): """Return items or rows from X using indices. [...]""" * if hasattr(X, "iloc"): # Pandas Dataframes and Series # [Exception handling removed ...] * return X.iloc[indices] elif hasattr(X, "shape"): if hasattr(X, 'take'): # [And other conditions removed!] # This is often substantially faster than X[indices] return X.take(indices, axis=0) else: return X[indices] else: return [X[idx] for idx in indices] ``` --- class: center, middle # 2. Pipeline --- # The doc ![Documentation of the fit() method of the Pipeline class](img/pipeline_fit_doc.png) --- # The code ```python class Pipeline(...): # ... def _transform(self, X): Xt = X for name, transform in self.steps: if transform is not None: * Xt = transform.transform(Xt) return Xt ``` --- # Pandas pipeline step ```python class SubstractMeanFromColumn(BaseEstimator, TransformerMixin): def __init__(self, column_name): self.column_name = column_name def fit(self, X, y=None): if not isinstance(X, DataFrame): raise TypeError self.mean_ = X[self.column_name].mean() def transform(self, X): if not isinstance(X, DataFrame): raise TypeError Xt = X.copy() Xt[self.column_name] -= self.mean_ returt Xt ``` ??? Same as normal transformer: (learn in fit, apply in transform), but taking and returning pandas dataframes! --- class: center, middle # The Results --- # Combined example ```python dataset = load_complicated_retail_dataset() *dataset.sort_values('date') TARGET = 'total_unit_sales' model = Pipeline([ # Create time regressor columns # As many other pipeline steps here as you would like... * # TODO Add a pipeline step here to transform pandas DataFrame * # to numpy array, that works in a train / test setting. ('pca', PCA()), ('rf', RandomForestRegressor())]) hp_dist = {...} *split = TimeSeriesSplit(n_splits=5) model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=split) model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET]) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` --- # Convert into numpy array * dask-ml * sklearn-pandas --- # Example (finalized) ```python dataset = load_complicated_retail_dataset() dataset.sort_values('date') TARGET = 'total_unit_sales' model = Pipeline([ # Create time regressor columns # As many other pipeline steps here as you would like... * ('cat', Categorizer()), * ('dummy_enc', DummyEncoder()), ('pca', PCA()), ('rf', RandomForestRegressor())]) hp_dist = {...} split = TimeSeriesSplit(n_splits=5) model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=split) model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET]) print('Best out-of-sample score:', model_hpsearch.best_score_) ``` ??? ... which is much nicer and more powerful than when both were separated. --- class: center, middle # Our experience ??? - We have complex, non-i.i.d data - Mostly for pipelines in the beginning (very nice way to structure things when you're trying many different preprocessings). - Couldn't use splitters before 0.18 (explain why), but maybe the 0.18+ version is powerful enough for us now. - Wrote our own code to do dummy encoding of pandas dataframes into model matrix. Don't. --- # Summary * Use at least scikit-learn 0.16.1 (April 2015), preferably 0.18 (September 2016) or greater. * For splits, anything pandas-like (i.e. with `.iloc[]` indexer) will work. * For pipelines... In theory, as long as each step can accept as input the output of the previous step, you're fine (whatever the types). In practice, you probably want to structure things like this: 1. All your pandas pipeline steps at the beginning 2. Convert into all-numerical numpy array, using either: 1. `[Categorizer(), DummyEncoder()]` (from dask-ml) 2. `DataFrameMapper(...)` (from sklearn-pandas) 3. Pipeline steps, if any, that operate on all-numerical data (like PCA). 4. The actual model. * Writing a pandas-based transform is the same as writing a numpy one, except for the inputs and output. --- class: small-text # Some relevant scikit-learn changelog entries **0.18**: * The new module `sklearn.model_selection` [...] introduces new possibilities such as nested cross-validation and better manipulation of parameter searches with Pandas. * Added new cross-validation splitter `model_selection.TimeSeriesSplit` to handle time series data. * The cross-validation iterators are replaced by cross-validation splitters available from `sklearn.model_selection`, allowing for nested cross-validation. **0.16.1**: Fix a regression where `utils.shuffle` converted lists and dataframes to arrays. **0.16**: `GridSearchCV` and `cross_val_score` and other meta-estimators don’t convert pandas DataFrames into arrays any more, allowing DataFrame specific operations in custom estimators. **0.15.1**: Fixed data input checks of most estimators to accept input data that implements the NumPy `__array__` protocol. This is the case for for pandas.Series and pandas.DataFrame in recent versions of pandas. ??? If you want to dig deeper. --- class: small-text # Some relevant scikit-learn commits * https://github.com/scikit-learn/scikit-learn/commit/b080d93b5ea15111ee1a97f142e020dee661dbb6 By amueller. July 18th, 2014 (released in 0.16). Don't force other data formats into numpy arrays. Add tests to make sure that `cross_val_score`, `train_test_split` keep working with pandas DataFrame. Add `.iloc[]` support to `safe_indexing` helper function. * https://github.com/scikit-learn/scikit-learn/commit/6e2a83b4e184f0e51aead75d5c82fc0284fa6233 By amueller. July 20th, 2014 (released in 0.16). Introduce `indexable` helper function (with support for `.iloc[]` indexer), used to check that a data object can be used for cross-validation, etc. Big refactoring of `check_array` (used by estimators). * https://github.com/scikit-learn/scikit-learn/commit/664d78eb7cd47e477391f943c53621e1196325a3 By raghavrv. July 2nd, 2015 (released in 0.17). Removed support for deprecated sequence of squence data format. These needed to be systematically converted to another, more efficent format at the beginning of scikit-learn methods and functions. ??? For digging deeper too. --- # Related projects * [**dask/dask-ml**](https://dask-ml.readthedocs.io/en/latest/): `Categorizer` and `DummyEncoder` [(direct doc link)](https://dask-ml.readthedocs.io/en/latest/preprocessing.html#additional-tranformers) can be used to convert pandas DataFrame to numpy array. * [**scikit-learn-contrib/categorical-encoding**](http://contrib.scikit-learn.org/categorical-encoding/): "A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques." * [**scikit-learn-contrib/sklearn-pandas**](https://github.com/scikit-learn-contrib/sklearn-pandas): Started before the changes to scikit-learn to make it more friendly towards pandas... Includes a very flexible DataFrameMapper class. * Still missing (?): library of common pandas transforms. Although sklearn.preprocessing.FunctionTransformer (new in 0.17), with `validate=False`, helps for basics. --- # If you're from the future... Maybe things will be even better when you see this! * https://github.com/scikit-learn/scikit-learn/issues/10165 Issue to improve documentation of scikit-learn, to showcase more how it plays well with pandas. --- class: center, middle # Thank you! ## http://christianhudon.name/talks/#PyConCA-2017