Structured data modeling using Tensorflow

Ajay krishnan
8 min readJan 23, 2022

--

Structured data is quantitative and is often displayed as numbers, dates, values and strings. Structured data is stored in rows and columns.
Unlike Structured data, unstructured data is qualitative data and includes text, video, audio, images etc…

Let’s use a small heart disease dataset There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. We will use this information to predict whether a patient has heart disease, which is a binary classification task.

Lets first play around and understand how to model structured data in tensorflow platform then lets learn how to create models using keras Sequential API, Functional API and Subclassing API .
Finally lets finish by creating an end to end model.

Without wasting any time let’s jump right in..

# Loading Dataurl='https://storage.googleapis.com/download.tensorflow.org/data/heart.csv'csv_file = tf.keras.utils.get_file('heart.csv', url)heart_df = pd.read_csv(csv_file)
heart_df.head(5)
heart disease data
# Feature and target
y = heart_df.pop('target')
X = heart_df

Modeling structured data

Let’s build models to predict the label contained in the `target` column.
We can build models using :

  • Sequential API
  • Functional API
  • Subclassing API

Models defined in either the Sequential, Functional, or Subclassing style can be trained in two ways.

  • Built-in training loops
  • Custom training loops

You can use either a built-in training routine and loss function , using model.fit(….) , model.compile(….) or if you need the added complexity of a custom training loop (for example, if you’d like to write your own gradient clipping code) or loss function.
For this tutorial lets use this built-in training loops

When to use a Sequential model ?

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

## Baseline model using Sequential APInumeric_feature_names = ['age',
'thalach',
'trestbps',
'chol',
'oldpeak']
features = X[numeric_feature_names]

# converting features into tensors
feat_tensor = tf.convert_to_tensor(features)

# baseline model
baseline = tf.keras.Sequential([
tf.keras.layers.Dense(32,'relu'),
tf.keras.layers.Dense(1),

])
# compile
baseline.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# training
baseline.fit(feat_tensor, y, epochs=5)
Sequential model training logs

Can we improve the performance of our sequential model ?

what can we do?
Let’s take a look at the statistics of the input data

# if we look at the mean and std of the numerical features its bouncing around
features.describe().T.loc[:,['mean','std']]
features.describe().T.loc[:,[‘mean’,’std’]]

Use the normalization layer as the first layer of our model

# let's add a normalizing layer to the baseline model
normalize = tf.keras.layers.Normalization()
normalize.adapt(features)

baseline_normalized = tf.keras.Sequential([
normalize,
tf.keras.layers.Dense(32,'relu'),
tf.keras.layers.Dense(1),
])
# compile
baseline_normalized.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# training
baseline_normalized.fit(feat_tensor, y, epochs=5)
Sequential model on normalized data

A Sequential model is not appropriate when:

  • Your model has multiple inputs or multiple outputs
  • Any of your layers has multiple inputs or multiple outputs
  • You need to do layer sharing
  • You want non-linear topology (e.g. a residual connection, a multi-branch model)

When you start dealing with heterogenous data, it is no longer possible to treat the DataFrame as if it were a single array. TensorFlow tensors require that all elements have the same dtype.

So, in this case, you need to start treating it as a dictionary of columns, where each column has a uniform dtype. A DataFrame is a lot like a dictionary of arrays, so typically all you need to do is cast the DataFrame to a Python dict.

Model Subclassing

The model subclassing is an imperative style, in which you build a model by extending a class. Building models in this style feels like Object-Oriented Python development.

From a developer perspective, the way this works is you extend a Model class defined by the framework, instantiate your layers, then write the forward pass of your model imperatively (the backward pass is generated automatically).

# Utitlity function
def stack_dict(inputs, fun=tf.stack):
values = []
for key in sorted(inputs.keys()):
values.append(tf.cast(inputs[key], tf.float32))

return fun(values, axis=-1)
# The Model-subclass style

class MyBaselineModel(tf.keras.Model):
def __init__(self):
super(MyBaselineModel,self).__init__()
self.normalize = tf.keras.layers.Normalization()
self.dense = tf.keras.layers.Dense(32, 'relu')
self.output_layer = tf.keras.layers.Dense(1)

def adapt(self,inputs):
inputs = stack_dict(inputs)
self.normalize.adapt(inputs)

def call(self, x):
x = stack_dict(x)
x = self.normalize(x)
x = self.dense(x)
x = self.output_layer(x)
return x

Training the model

# initialize an instance of the class MyBaselineModel
my_base = MyBaselineModel()
# adapt the normalizer
my_base.adapt(dict(features))
# compile the model
my_base.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# train the model
my_base.fit(dict(features), y, epochs=5)
Model training logs

Functional API

The Functional API is a way to create more flexible models. It can handle non-linear topology, models with shared layers, and models with multiple inputs or outputs.

The functional API is higher-level, easier and safer, and has a number of features that subclassed models do not support.

Training, evaluation, and inference work exactly in the same way for models built using the functional API as for Sequential models.

Unlike Model Subclassing, Functional API is less verbose.

In the functional API, the input specification (shape and dtype) is created in advance (using tf.keras.layers.Input). Every time you call a layer, the layer checks that the specification passed to it matches its assumptions, and it will raise a helpful error message if not. This guarantees that any model you can build with the functional API will run.

# The Keras functional style

inputs = {}
for col in features.columns:
inputs[col] = tf.keras.Input(shape=(1,), name=col,
dtype=tf.float32)
inputs
Symbolic inputs
# lets concatenate the inputs 
concated_inputs = stack_dict(inputs, fn=tf.concat)
concated_inputs
concatenation of symbolic inputs
# normalization
normalize = tf.keras.layers.Normalization()
normalize.adapt(stack_dict(dict(features)))

x = normalize(concated_inputs)
x = tf.keras.layers.Dense(32, 'relu')(x)
results = tf.keras.layers.Dense(1)(x)

my_base_model = tf.keras.Model(inputs, results)

my_base_model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
tf.keras.utils.plot_model(my_base_model,
show_shapes=True,
rankdir='LR')
Model created using functional API
# Training
my_base_model.fit(dict(features), y, epochs=5)
Model training logs

End to End Example

If you’re passing a heterogenous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model.

numeric_feature_names = [ ‘ age ‘, ‘ thalach ‘, ‘ trestbps ‘, ‘ chol ‘, ‘ oldpeak ‘ ]
binary_feature_names = [ ‘ sex ‘, ‘ fbs ‘, ‘ exang ‘ ]
categorical_feature_names = [ ‘ cp ‘, ‘ restecg ‘, ‘ slope ‘, ‘ thal ‘, ‘ ca ‘ ]

End to end example using Functional API

# Symbolic inputs
inputs = {}

for col in X.columns:

if type(X[col][0]) == tf.int64:
col_type = tf.int64
elif type(X[col][0]) == str:
col_type = tf.string
else:
col_type = tf.float64

inputs[col] = tf.keras.Input(name=col, dtype=col_type, shape=(1,))

inputs
Symbolic Inputs

Binary inputs don’t need any pre-processing, just add them to list.

preprocessed = []

for binary_input in binary_feature_names:
inp = inputs[binary_input]
preprocessed.append(tf.cast(inp, tf.float32))

Numerical inputs needs to be normalized.

for numerical in numeric_feature_names:
numerical_input = inputs[numerical]
normalizer = tf.keras.layers.Normalization(axis=None)
normalizer.adapt(X[numerical].values)
normailzed = normalizer(numerical_input)
preprocessed.append(normailzed)

For Categorical features you’ll first need to encode them into either binary vectors or embeddings.

for cat in categorical_feature_names:
cat_input = inputs[cat]
cat_feature = X[cat]
if type(cat_feature[0])==str:
lookup = tf.keras.layers.StringLookup()
else:
lookup =tf.keras.layers.IntegerLookup()
lookup.adapt(cat_feature)
encoder = tf.keras.layers.CategoryEncoding(
num_tokens=lookup.vocabulary_size(),
output_mode='one_hot')
vocab = lookup(cat_input)
encoding = encoder(vocab)
preprocessed.append(encoding)
List of preprocessed inputs

Concatenate all the preprocessed features along the depth axis, so each dictionary-example is converted into a single vector.

preprocessed_concated = tf.keras.layers.concatenate(preprocessed)
preprocessed_concated
Concated output
# Now create a model out of that calculation so it can be reused:
preprocessor = tf.keras.Model(inputs, preprocessed_concated)

tf.keras.utils.plot_model(preprocessor,
rankdir='LR',
show_shapes=True)
Preprocessing head of our model

Let’s see preprocessing in action

# preprocessing example
preprocessor(dict(X.iloc[1:2]))
Preprocessed output

Let’s build the model body and train it. For simplicity lets create the body using a Sequential API. All models in the keras API can interact with each other, whether they're Sequential models, Functional models, or Subclassed models.

# lets build the body of the model
body = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1)
])

Let’s connect the pieces together.
Building a model with Keras can feel as easy as “plugging LEGO bricks together.” We can mix and match API styles.

# Now put the two pieces together , Keras functional style.
prep = preprocessor(inputs)
results = body(prep)
my_model = tf.keras.Model(inputs, results)
# compiling
my_model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# training
my_model.fit(dict(X), y, epochs=5)
Final model training logs

Let’s wrap up this tutorial by creating the same using Subclassing API.
This is just to illustrate the power tensorflow brings to us.

End to End example : Subclassing API

class SubclassModel(tf.keras.Model):
def __init__(self,):
super(SubclassModel,self).__init__()
self.d1 = tf.keras.layers.Dense(32, 'relu')
self.d2 = tf.keras.layers.Dense(16, 'relu')
self.output_layer = tf.keras.layers.Dense(1)
# THE NAMES OF FEATURES self.numeric_feature_names = ['age',
'thalach',
'trestbps',
'chol',
'oldpeak']
self.binary_feature_names = ['sex', 'fbs', 'exang']
self.integer_categorical_feature_names = ['cp',
'restecg',
'slope',
'ca']
self.string_categorical_feature_names = ['thal',]

# To noramalize the features
def normalizer(self, feature_names, features):
df = features.copy()
df = df.loc[:,feature_names]
normalizer = tf.keras.layers.Normalization()
normalizer.adapt(df)
self.normalizer = normalizer
return normalizer(df)
# To encode the categorical features
def categoryEncoding(self, feature_names, features,
isString=False):
df = features.copy()
df = df.loc[:,feature_names]
if type(cat_feature[0])==str:
self.lookup = tf.keras.layers.StringLookup()
else:
self.lookup =tf.keras.layers.IntegerLookup()

self.lookup.adapt(df)
self.encoder = tf.keras.layers.CategoryEncoding(
num_tokens=self.lookup.vocabulary_size())
return self.encoder(self.lookup(df))

def preprocessing(self, features):
binary_features = tf.convert_to_tensor(
tf.cast(
X.loc[:,binary_feature_names],
tf.float32))
normalized = self.normalizer(
self.numeric_feature_names,
features)
int_cat_encoded = self.categoryEncoding(
self.integer_categorical_feature_names,
features, isString=False)
str_cat_encoded = self.categoryEncoding(
self.string_categorical_feature_names,
features, isString=True)
preprocessed = [binary_features,
normalized,
int_cat_encoded,
str_cat_encoded]
return tf.keras.layers.concatenate(preprocessed)

def call(self, x):
x = self.d1(x)
x = self.d2(x)
x = self.output_layer(x)
return x

Let’s create an instance of the model and train it

# create and instance of the model
subclass_model = SubclassModel()
# compile the model
subclass_model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# do feature preprocessing
features = subclass_model.preprocessing(X)
# train the model
subclass_model.fit(features, y, epochs=5)
Training logs
GitHub Repository

Where to go next ?

Resources

--

--

No responses yet