Pandas & Scikit-learn: Secret Best Friends!

# Pandas & Scikit-learn: Secret Best Friends!
## Christian Hudon | JDA Labs

???
Assumptions...

---

# I ♥ Scikit-learn

???
Why? Let's say you have some data...

---

???
Iris dataset. All numerical.

---
class: middle

```python
dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(dataset.data,
                                                    dataset.target)
model = LogisticRegression()
model.fit(X_train, y_train)
```

???
Easy to learn, etc. Able to very get to "I've trained a model on this data."

But, unless problem very easy, will need some feature engineering...

---

# I _especially_ ♥ scikit-learn's Pipelines!

---
class: middle

```python
dataset = load_iris()
*model = Pipeline([
*   # As many other pipeline steps here as you would like...
*   ('pca', PCA()),
*   ('rf', RandomForestClassifier())])

hp_dist = {'pca__n_components': randint(2, 5),
           'rf__max_depth': [3, None],
           'rf__max_features': [2, None],
           'rf__min_samples_split': randint(2, 5),
           'rf__min_samples_leaf': randint(1, 5),
           'rf__bootstrap': [True, False],
           'rf__criterion': ['gini', 'entropy']}

*cross_valid_split = KFold(n_splits=5, shuffle=True)
*model_hpsearch = RandomizedSearchCV(model, hp_dist,
*                                   cv=cross_valid_split)

model_hpsearch.fit(dataset.data, dataset.target)
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

---

???
But sometimes your data looks less like a big grid of numbers...

---

.center[![Retail data, with date, categorical columns in addition to numerical.](img/complex_data.png)]

???
... and more like this.

---
class: center, middle

# I ♥ Pandas

???
Pandas really shines for exploring this kind of data...

But let's say you're explored enough and are ready to train a first model

---
```python
dataset = load_iris()
model = Pipeline([
   # As many other pipeline steps here as you would like...
   ('pca', PCA()),
   ('rf', RandomForestClassifier())])

cross_valid_split = KFold(n_splits=5, shuffle=True)
model_hpsearch = RandomizedSearchCV(model, hp_dist,
                                   cv=cross_valid_split)

model_hpsearch.fit(dataset.data, dataset.target)
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

---

```python
*dataset = load_complicated_retail_dataset()
*TARGET = 'total_unit_sales'

model = Pipeline([
   # Add / remove features engineering steps here!
   ('pca', PCA()),
   ('rf', RandomForestRegressor())])

cross_valid_split = KFold(n_splits=5, shuffle=True)
model_hpsearch = RandomizedSearchCV(model, hp_dist,
                                   cv=cross_valid_split)

*model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET])
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

???
Well, a Pandas dataframe doesn't behave like a numpy array...

---

```python
dataset = load_complicated_retail_dataset()
TARGET = 'total_unit_sales'

*model = RandomForestRegressor()

hp_dist = {'max_depth': [3, None],
           'max_features': [2, None],
           'min_samples_split': randint(2, 5),
           'min_samples_leaf': randint(1, 5),
           'bootstrap': [True, False],
           'criterion': ['gini', 'entropy']}

cross_valid_split = KFold(n_splits=5, shuffle=True)
model_hpsearch = RandomizedSearchCV(model, hp_dist,
                                   cv=cross_valid_split)

model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET])
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

---

```python
dataset = load_complicated_retail_dataset()

model = RandomForestRegressor()
*model.fit(dataset.drop(TARGET, axis=1), dataset[TARGET])
```

???
This *almost* works...

---

```python
dataset = load_complicated_retail_dataset()
TARGET = 'total_unit_sales'

*model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies()
*target_values = dataset[TARGET].values

model = RandomForestRegressor()
model.fit(model_matrix, target_values)
```

???
Now *this* works. But not too surprising...
---

```python
dataset = load_complicated_retail_dataset()
TARGET = 'total_unit_sales'

*# ☝︎☝︎ Pandas above ☝︎☝︎
model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies()
target_values = dataset[TARGET].values
*# ☟☟ Scikit-learn below ☟☟

model = RandomForestRegressor()
model.fit(model_matrix, target_values)
```

???
If don't mix, convert to numpy array... yes, things work.

---

```python
dataset = load_complicated_retail_dataset()
TARGET = 'total_unit_sales'
*dataset.assign(feature1=dataset.groupby(...).transform(...),
*               feature2=...)
*# More pandas transformations here...

# ☝︎☝︎ Pandas above ☝︎☝︎
model_matrix = dataset.drop(['date', TARGET], axis=1).get_dummies()
target_values = dataset[TARGET].values
# ☟☟ Scikit-learn below ☟☟

*model_matrix = PCA().fit_transform(model_matrix)
*# More scikit-learn, numpy-based transforms here...
model = RandomForestRegressor()
model.fit(model_matrix, target_values)
```

---
# Lost with simple solution

With pandas & scikit-learn kept separated, can't really use:

1. Splits
2. Pipelines

???
And then, because of that: model selection machinery, etc.

---

# Can we do better?

???
Can we mix pandas and scikit-learn together? Can they work together?

---

# Yes!
## (Used not to be the case, though.)

---

# Four Periods
1. No (< 0.15)
2. Hmm. (0.15-0.16.1)
3. Yes... (≥ 0.16.1)
4. YES! (≥ 0.18)
---

# 1. Splits
(and model selection)

---

# The doc

![Documentation of train_test_split()](img/splitter_doc.png)

---

# The code

```python
def indexable(*iterables):
    """Make arrays indexable for cross-validation. [...]"""

result = []
    for X in iterables:
        if sp.issparse(X):
            result.append(X.tocsr())
*       elif hasattr(X, "__getitem__") or hasattr(X, "iloc"):
*           result.append(X)
        elif X is None:
            result.append(X)
        else:
            result.append(np.array(X))
    check_consistent_length(*result)
    return result
```
---

# The code (part 2)

```python
def safe_indexing(X, indices):
    """Return items or rows from X using indices. [...]"""

*   if hasattr(X, "iloc"):
        # Pandas Dataframes and Series
        # [Exception handling removed ...]
*       return X.iloc[indices]
    elif hasattr(X, "shape"):
        if hasattr(X, 'take'): # [And other conditions removed!]
            # This is often substantially faster than X[indices]
            return X.take(indices, axis=0)
        else:
            return X[indices]
    else:
        return [X[idx] for idx in indices]
```

---

# 2. Pipeline

---

# The doc

![Documentation of the fit() method of the Pipeline class](img/pipeline_fit_doc.png)
---

# The code
```python
class Pipeline(...):
    # ...
    def _transform(self, X):
        Xt = X
        for name, transform in self.steps:
            if transform is not None:
*               Xt = transform.transform(Xt)
        return Xt
```

---

# Pandas pipeline step

```python
class SubstractMeanFromColumn(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name

def fit(self, X, y=None):
        if not isinstance(X, DataFrame):
            raise TypeError
        self.mean_ = X[self.column_name].mean()

def transform(self, X):
        if not isinstance(X, DataFrame):
            raise TypeError
        Xt = X.copy()
        Xt[self.column_name] -= self.mean_
        returt Xt
```

???
Same as normal transformer: (learn in fit, apply in transform), but taking and returning pandas dataframes!

---

# The Results

---

# Combined example

```python
dataset = load_complicated_retail_dataset()
*dataset.sort_values('date')
TARGET = 'total_unit_sales'

model = Pipeline([
   # Create time regressor columns
   # As many other pipeline steps here as you would like...
*  # TODO Add a pipeline step here to transform pandas DataFrame
*  # to numpy array, that works in a train / test setting.
   ('pca', PCA()),
   ('rf', RandomForestRegressor())])

hp_dist = {...}
*split = TimeSeriesSplit(n_splits=5)
model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=split)

model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET])
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

---

# Convert into numpy array

* dask-ml

* sklearn-pandas

---

# Example (finalized)

```python
dataset = load_complicated_retail_dataset()
dataset.sort_values('date')
TARGET = 'total_unit_sales'

model = Pipeline([
   # Create time regressor columns
   # As many other pipeline steps here as you would like...
*  ('cat', Categorizer()),
*  ('dummy_enc', DummyEncoder()),
   ('pca', PCA()),
   ('rf', RandomForestRegressor())])

hp_dist = {...}
split = TimeSeriesSplit(n_splits=5)
model_hpsearch = RandomizedSearchCV(model, hp_dist, cv=split)

model_hpsearch.fit(dataset.drop(TARGET, axis=1), dataset[TARGET])
print('Best out-of-sample score:', model_hpsearch.best_score_)
```

???
... which is much nicer and more powerful than when both were separated.

---

# Our experience

???
- We have complex, non-i.i.d data
- Mostly for pipelines in the beginning (very nice way to structure things when
you're trying many different preprocessings).
- Couldn't use splitters before 0.18 (explain why), but maybe the 0.18+ version
is powerful enough for us now.
- Wrote our own code to do dummy encoding of pandas dataframes into model matrix. Don't.
---

# Summary

* Use at least scikit-learn 0.16.1 (April 2015), preferably 0.18 (September 2016) or greater.

* For splits, anything pandas-like (i.e. with `.iloc[]` indexer) will work.

* For pipelines... In theory, as long as each step can accept as input the output of the previous step, you're fine (whatever the types). In practice, you probably want to structure things like this:
    1. All your pandas pipeline steps at the beginning
    2. Convert into all-numerical numpy array, using either:
       1. `[Categorizer(), DummyEncoder()]` (from dask-ml)
       2. `DataFrameMapper(...)` (from sklearn-pandas)
    3. Pipeline steps, if any, that operate on all-numerical data (like PCA).
    4. The actual model.

* Writing a pandas-based transform is the same as writing a numpy one, except for the inputs and output.

---
class: small-text

# Some relevant scikit-learn changelog entries

**0.18**:
* The new module `sklearn.model_selection` [...] introduces new possibilities such as nested cross-validation and better manipulation of parameter searches with Pandas.
* Added new cross-validation splitter `model_selection.TimeSeriesSplit` to handle time series data.
* The cross-validation iterators are replaced by cross-validation splitters available from `sklearn.model_selection`, allowing for nested cross-validation.

**0.16.1**: Fix a regression where `utils.shuffle` converted lists and dataframes to
arrays.

**0.16**: `GridSearchCV` and `cross_val_score` and other meta-estimators don’t convert
pandas DataFrames into arrays any more, allowing DataFrame specific operations
in custom estimators.

**0.15.1**: Fixed data input checks of most estimators to accept input data that
implements the NumPy `__array__` protocol. This is the case for for pandas.Series
and pandas.DataFrame in recent versions of pandas.

???
If you want to dig deeper.

---
class: small-text

# Some relevant scikit-learn commits

* https://github.com/scikit-learn/scikit-learn/commit/b080d93b5ea15111ee1a97f142e020dee661dbb6

By amueller. July 18th, 2014 (released in 0.16). Don't force other data formats into numpy arrays. Add tests to make sure
 that `cross_val_score`, `train_test_split` keep working with pandas DataFrame. Add `.iloc[]` support
 to `safe_indexing` helper function.

* https://github.com/scikit-learn/scikit-learn/commit/6e2a83b4e184f0e51aead75d5c82fc0284fa6233

By amueller. July 20th, 2014 (released in 0.16). Introduce `indexable` helper function (with support for `.iloc[]` indexer), used
 to check that a data object can be used for cross-validation, etc. Big refactoring of `check_array` (used by estimators).

* https://github.com/scikit-learn/scikit-learn/commit/664d78eb7cd47e477391f943c53621e1196325a3

By raghavrv. July 2nd, 2015 (released in 0.17). Removed support for deprecated sequence of squence
 data format. These needed to be systematically converted to another, more efficent format
 at the beginning of scikit-learn methods and functions.

???
For digging deeper too.

---
# Related projects

* [**dask/dask-ml**](https://dask-ml.readthedocs.io/en/latest/): `Categorizer` and `DummyEncoder` [(direct doc link)](https://dask-ml.readthedocs.io/en/latest/preprocessing.html#additional-tranformers) can be used to convert pandas DataFrame to numpy array.

* [**scikit-learn-contrib/categorical-encoding**](http://contrib.scikit-learn.org/categorical-encoding/): "A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques."

* [**scikit-learn-contrib/sklearn-pandas**](https://github.com/scikit-learn-contrib/sklearn-pandas): Started before the changes to scikit-learn to make it more friendly towards pandas... Includes a very flexible DataFrameMapper class.

* Still missing (?): library of common pandas transforms. Although sklearn.preprocessing.FunctionTransformer (new in 0.17), with `validate=False`, helps for basics.

---

# If you're from the future...

Maybe things will be even better when you see this!

* https://github.com/scikit-learn/scikit-learn/issues/10165

Issue to improve documentation of scikit-learn, to showcase more how it plays well with pandas.

---

# Thank you!

## http://christianhudon.name/talks/#PyConCA-2017