Tabular data preprocessing¶

Overview¶

This package contains the basic class to define a transformation for preprocessing dataframes of tabular data, as well as basic TabularProc. Preprocessing includes things like

replacing non-numerical variables by categories, then their ids,
filling missing values,
normalizing continuous variables.

In all those steps we have to be careful to use the correspondence we decide on our training set (which id we give to each category, what is the value we put for missing data, or how the mean/std we use to normalize) on our validation or test set. To deal with this, we use a special class called TabularProc.

The data used in this document page is a subset of the adult dataset. It gives a certain amount of data on individuals to train a model to predict whether their salary is greater than \$50k or not.

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
train_df, valid_df = df.iloc[:800].copy(), df.iloc[800:1000].copy()
train_df.head()

We see it contains numerical variables (like age or education-num) as well as categorical ones (like workclass or relationship). The original dataset is clean, but we removed a few values to give examples of dealing with missing variables.

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
cont_names = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Transforms for tabular data¶

Base class for creating transforms for dataframes with categorical variables cat_names and continuous variables cont_names. Note that any column not in one of those lists won't be touched.

The following TabularProc are implemented in the fastai library. Note that the replacement from categories to codes as well as the normalization of continuous variables are automatically done in a TabularDataBunch.

Variables in cont_names aren't affected.

tfm = Categorify(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)

Since we haven't changed the categories by their codes, nothing visible has changed in the dataframe yet, but we can check that the variables are now categorical and view their corresponding codes.

train_df['workclass'].cat.categories

Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
      dtype='object')

The test set will be given the same category codes as the training set.

valid_df['workclass'].cat.categories

Index([' ?', ' Federal-gov', ' Local-gov', ' Private', ' Self-emp-inc',
       ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
      dtype='object')

cat_names variables are left untouched (their missing value will be replaced by code 0 in the TabularDataBunch). fill_strategy is adopted to replace those nans and if add_col is True, whenever a column c has missing values, a column named c_nan is added and flags the line where the value was missing.

Fills the missing values in the cont_names columns with the ones picked during train.

train_df[cont_names].head()

tfm = FillMissing(cat_names, cont_names)
tfm(train_df)
tfm(valid_df, test=True)
train_df[cont_names].head()

Values missing in the education-num column are replaced by 10, which is the median of the column in train_df. Categorical variables are not changed, since nan is simply used as another category.

valid_df[cont_names].head()

norm = Normalize(cat_names, cont_names)

norm.apply_train(train_df)
train_df[cont_names].head()

norm.apply_test(valid_df)
valid_df[cont_names].head()

Treating date columns¶

Will drop the column in df if the flag is True. The time flag decides if we go down to the time parts or stick to the date parts.

df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})
add_datepart(df, 'col1') # inplace
df.head()

show_doc(add_cyclic_datepart)

df = pd.DataFrame({'col1': ['02/03/2017', '02/04/2017', '02/05/2017'], 'col2': ['a', 'b', 'a']})
df = add_cyclic_datepart(df, 'col1') # returns a dataframe
df.head()

Splitting data into cat and cont¶

Parameters:

df: A pandas data frame.
max_card: Maximum cardinality of a numerical categorical variable.
dep_var: A dependent variable.

Return:

cont_names: A list of names of continuous variables.
cat_names: A list of names of categorical variables.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'a'], 'col3': [0.5, 1.2, 7.5], 'col4': ['ab', 'o', 'o']})
df

cont_list, cat_list = cont_cat_split(df=df, max_card=20, dep_var='col4')
cont_list, cat_list

(['col3'], ['col1', 'col2'])

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
0	49	101320	12.0	0	1902	40
1	44	236746	14.0	10520	0	45
2	38	96185	NaN	0	0	32
3	38	112847	15.0	0	0	40
4	42	82297	NaN	0	0	50

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
0	49	101320	12.0	0	1902	40
1	44	236746	14.0	10520	0	45
2	38	96185	10.0	0	0	32
3	38	112847	15.0	0	0	40
4	42	82297	10.0	0	0	50

	age	fnlwgt	education-num	capital-gain	hours-per-week
800	45	96975	10.0	0	40
801	46	192779	10.0	15024	60
802	36	376455	10.0	0	38
803	25	50053	10.0	0	45
804	37	164526	10.0	0	40

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
0	0.829039	-0.812589	0.981643	-0.136271	4.416656	-0.050230
1	0.443977	0.355532	2.078450	1.153121	-0.228760	0.361492
2	-0.018098	-0.856881	-0.115165	-0.136271	-0.228760	-0.708985
3	-0.018098	-0.713162	2.626854	-0.136271	-0.228760	-0.050230
4	0.289952	-0.976672	-0.115165	-0.136271	-0.228760	0.773213

Deprecated: This is v1 of fastai, which is not supported.

tabular.transform

Tabular data preprocessing¶

Overview¶

Transforms for tabular data¶

`class` `TabularProc`[source][test]

`call`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`class` `Categorify`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`class` `FillMissing`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`FillStrategy`[test]

`class` `Normalize`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

Treating date columns¶

`add_datepart`[source][test]

`add_cyclic_datepart`[source][test]

Splitting data into cat and cont¶

`cont_cat_split`[source][test]

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
800	0.520989	-0.850066	-0.115165	-0.136271	-0.22876	-0.050230
801	0.598002	-0.023706	-0.115165	1.705157	-0.22876	1.596657
802	-0.172123	1.560596	-0.115165	-0.136271	-0.22876	-0.214919
803	-1.019260	-1.254793	-0.115165	-0.136271	-0.22876	0.361492
804	-0.095110	-0.267403	-0.115165	-0.136271	-0.22876	-0.050230

	col2	col1Year	col1Month	col1Week	col1Day	col1Dayofweek	col1Dayofyear	col1Is_month_end	col1Is_month_start	col1Is_quarter_end	col1Is_quarter_start	col1Is_year_end	col1Is_year_start	col1Elapsed
0	a	2017	2	5	3	4	34	False	False	False	False	False	False	1486080000
1	b	2017	2	5	4	5	35	False	False	False	False	False	False	1486166400
2	a	2017	2	5	5	6	36	False	False	False	False	False	False	1486252800

	col2	col1weekday_cos	col1weekday_sin	col1day_month_cos	col1day_month_sin	col1month_year_cos	col1month_year_sin	col1day_year_cos	col1day_year_sin
0	a	-0.900969	-0.433884	0.900969	0.433884	0.866025	0.5	0.842942	0.538005
1	b	-0.222521	-0.974928	0.781831	0.623490	0.866025	0.5	0.833556	0.552435
2	a	0.623490	-0.781831	0.623490	0.781831	0.866025	0.5	0.823923	0.566702

	col1	col2	col3	col4
0	1	a	0.5	ab
1	2	b	1.2	o
2	3	a	7.5	o

Deprecated: This is v1 of fastai, which is not supported.

tabular.transform

Tabular data preprocessing¶

Overview¶

Transforms for tabular data¶

class TabularProc[source][test]

__call__[source][test]

apply_train[source][test]

apply_test[source][test]

class Categorify[source][test]

apply_train[source][test]

apply_test[source][test]

class FillMissing[source][test]

apply_train[source][test]

apply_test[source][test]

`FillStrategy`[test]

class Normalize[source][test]

apply_train[source][test]

apply_test[source][test]

Treating date columns¶

add_datepart[source][test]

add_cyclic_datepart[source][test]

Splitting data into cat and cont¶

cont_cat_split[source][test]

`class` `TabularProc`[source][test]

`call`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`class` `Categorify`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`class` `FillMissing`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`class` `Normalize`[source][test]

`apply_train`[source][test]

`apply_test`[source][test]

`add_datepart`[source][test]

`add_cyclic_datepart`[source][test]

`cont_cat_split`[source][test]