Hands-on TensorBoard (TensorFlow Dev Summit 2017)

By: Google Developers

1289   14   105980

Uploaded on 02/15/2017

Join Dandelion Mané in this talk as they demonstrate all the amazing things you can do with TensorBoard. You'll learn how to visualize your TensorFlow graphs, monitor training performance, and explore how your models represent your data. The code examples shown are available here: https://goo.gl/ZwGnPE.

Visit the TensorFlow website for all session recordings: https://goo.gl/bsYmza

Subscribe to the Google Developers channel at http://goo.gl/mQyv5L

Comments (15):

By anonymous    2017-09-20

Sadly, I cannot find a more comprehensive documentation. Below I collect all related resources:

PS: Thanks for upvoting me. Now I can post all the links.

Original Thread

By anonymous    2017-09-20

While @rmeerten's answer is correct, you can consider also using TensorBoard which can be a useful tool for debugging your models and seeing what's happening. For background, you can also check out the TensorBoard session from the TensorFlow Dev Summit.

Original Thread

By anonymous    2017-09-20

There are two ways to profile models. One way is a tensorboard. Here is a comprehensive tutorial about it and here is a good video.

Additionally, clicking on a node will display the exact total memory, compute time, and tensor output sizes.

enter image description here

Another way is tensorflow debugger, which also has tutorials.

Original Thread

By anonymous    2017-10-22

There is an awesome video tutorial (https://www.youtube.com/watch?v=eBbEDRsCmv4) on Tensorboard that describes almost everything about Tensorboard (Graph, Summaries etc.)

Original Thread

By anonymous    2017-10-22

  1. Variable summaries (scalar, histogram, image, text, etc) help track your model through the learning process. For example, tf.summary.scalar('v_loss', validation_loss) will add one point to the loss curve each time you call the summary op, thus give you a rough idea whether the model has converged and when to stop.
  2. It depends on your variable type. For values like loss, tf.summary.scalar shows the trend across epochs; for variables like weights in a layer, it would be better to use tf.summary.histogram, which shows the change of entire distribution of weights; I typically use tf.summary.image and tf.summary.text to check the images / texts my model generates over different epochs.
  3. The graph shows your model structure and the size of tensors flowing through each op. I found it hard at the beginning to organise ops nicely in the graph presentation, and I learnt a lot about variable scope from that. The other answer provides a link for a great tutorial for beginners.

Original Thread

By anonymous    2017-11-13

I am new to Tensorflow and TFLearn and when I was following some tutorials I found the tool Projector https://www.youtube.com/watch?v=eBbEDRsCmv4&t=629s. I was trying to use it with TFLearn but I couldn't found any example in the internet and the documentation in the Tensorflow page is not the very intuitive https://www.tensorflow.org/programmers_guide/embedding. Can somebody help me with a proper example that integrate TFLearn and projector.

Original Thread

By anonymous    2017-11-20

Straight to the point. I'm using SkipGram (see Word2Vec Tutorial) to obtain word embeddings for sequences of words. I've used Hands-on Tensorboard as a starting point. I'd like to run the model for different hyperparameters and compare the resulting weight matrices for using t-SNE (even if this is ill-advised). I understand that there are several ways to output the weight matrix and get around this problem, but I'd like to use tf.train.Saver() as described below.

  • Problem: I save each run in a separate folder, namely Tensorboard_data/folder1, Tensorboard_data/folder2 etc. Each folder contains the output of a tf.summary.Filewriter() and session tf.train.Saver()-class (after training is completed). Afterwards I run tensorboard --logdir /Tensorboard_data. As stated in Hands-on Tensorboard I successfully obtain a comparative plot of, say 4, runs in the histogram, scalar, weight section and graph. Once I press the tab-down menu of "Inactive" (the error might be here, why is it inactive?) and select Projector, I once again have 4 runs. However it seems I have messed my checkpoint file somehow - every run has the same amount of variance explained in PCA (and if I dotensorboard --logdir /Tensorboard_data/folder1 I get a different result. However, the last run, say folder4, correspond to the amount of variance explained.

I'm at a loss as to how Tensorflow/Tensorboard understands the checkpoint files outputted by tf.train.Saver() and is able to overwrite the previous runs despite the files being in different folders. This might be a bug, however since I'm not sure about this, I didn't want to bother the Tensorflow people over at Github.

Original Thread

By anonymous    2018-03-26

So I am running a CNN for a classification problem. I have 3 conv layers with 3 pooling layers. P3 is the output of the last pooling layer, whose dimensions are: [Batch_size, 4, 12, 48]_, and I want to flatten that matrix into a [Batch_size, 2304] size matrix, being 2304 = 4*12*48. I had been working with "Option A" (see below) for a while, but one day I wanted to try out "Option B", which would theoretically give me the same result. However, it did not. I have cheked the following thread before

Is tf.contrib.layers.flatten(x) the same as tf.reshape(x, [n, 1])?

but that just added more confusion, since trying "Option C" (taken from the aforementioned thread) gave a new different result.

P3 = tf.nn.max_pool(A3, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding='VALID')

P3_shape = P3.get_shape().as_list()

P = tf.contrib.layers.flatten(P3)                             <-----Option A

P = tf.reshape(P3, [-1, P3_shape[1]*P3_shape[2]*P3_shape[3]]) <---- Option B

P = tf.reshape(P3, [tf.shape(P3)[0], -1])                     <---- Option C

I am more inclined to go with "Option B" since that is the one I have seen in a video by Dandelion Mane (https://www.youtube.com/watch?v=eBbEDRsCmv4&t=631s), but I would like to understand why these 3 options are giving different results.

Thanks for any help!

Original Thread

By anonymous    2018-03-26

In the Hands-on TensorBoard video by Dandelion Mané he writes the following code when talking about collecting some summaries and writing them to disk:

#(... some code and some summaries...)
merged_summary = tf.summary.merge_all()
writer = tf.summary.FileWriter("/tmp/mnist_demo/3")

for i in range(2001):
  batch = mnist.train.next_batch(100)
  if i % 5 == 0:
    s = sess.run(merged_summary, feed_dict={x:batch[0], y: batch[1]})
    writer.add_summary(s, i)

So I took inspiration from there for my code, below I show a snippet:

costs = []   # To keep track of the cost per epoch
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Z5, labels=Y))
tf.summary.scalar('cost', cost)

for epoch in range(num_epochs):

        minibatches_cost = 0
        seed = seed + 1
        minibatches_train = random_mini_batches(X_train, Y_train, minibatch_size, seed)
        num_minibatches_train = len(minibatches_train)

        for minibatch in minibatches_train:

            # Select a minibatch
            (minibatch_X, minibatch_Y) = minibatch

            # Run the session to execute the optimizer and the cost, the feedict should contain a minibatch for (X,Y).
            _ , minibatch_cost = sess.run([optimizer, cost], feed_dict={X:minibatch_X, Y:minibatch_Y})

            minibatches_cost += minibatch_cost    # Adding the cost per minibatch

        epoch_cost = minibatches_cost / num_minibatches_train  # Cost per epoch

        if print_cost == True and epoch % 5 == 0:      # Print the cost
            print ("Cost after epoch %i: %f" % (epoch, epoch_cost))
            print ("Time elapsed: %i" % t_elapsed)

        if epoch % 1 == 0:                             # Append the cost

        if epoch % 1 == 0:                             # Write summaries
            summary_str = merged_summary.eval(feed_dict={X:minibatch_X, Y:minibatch_Y})
            file_writer.add_summary(summary_str, epoch)

My question is whether I am feeding the correct data to the session when evaluating merged_summary, because the way I am doing it now, the cost that is going to be written to disk in the summary is the cost of one minibatch (actually the last minibatch, generated with random_mini_batches), whereas the cost per epoch (epoch_cost in the code) that I save in the costs variable to then plot it and study its evolution, is the average cost per epoch (a more accurate measure of the cost than the cost per minibatch, I assume).

I guess feeding the whole training data is not the solution, but I am a bit confused with why only feeding one batch of the training data when evaluating the summaries.

Thanks for any help

Original Thread

By anonymous    2018-04-02

I am working on a project that aims to detect objects in certain difficult circumstances. I ran a test with Mask_RCNN on a dataset that contains that specific type of difficult examples and it did a pretty good job in some of them.

But some other examples didn't get detected surprisingly, when there is no obvious reason. To understand the reason behind this performance difference, I've been adviced to use Tensorboard. But then I realized that its mostly used for training phase, as I understood from this video.

At the end of the video, however, they mention about an integration project of Tensorboard, namely the Tensorflow Debugger Integration. But unfortunately I could not find further information regarding the continuation about that feature.

Is there any way to visualize weights and activation maps inside a CNN during inference/evaluation phase?

Original Thread

By anonymous    2018-05-09

In my thesis I am describing a Deep Reinforcement Learning (DRL) example which is written in python. I did not write the code, I just got it running an training on a linux server an it all works fine.

Now I am on the point where I want to visualize the accuracy/prediction, loss, learning stability and so on with tensorboard. I am working in a conda virutal environment where I have installed gym, atari-py, Pillow and PyOpenGL. On the server TensorFlow-GPU is installed. This is the link to the repository where I got the code from. I have watched tutorials about tensorboard and I get it, but I can not include those variables, like tf.summary.histogram() into my code. I do not know where exactly to put them and what variables might be important to visualize.

I already made it that tensorboard visualizes the whole network with the file writer as a graph which looks like this.

But now I am stuck. Every time I want to try including histograms of variables the code throws errors. It seems like I put them into the wrong positions, or I am trying to visualize the wrong variables. (I am not so familiar with python code, maybe I get the syntax wrong.) It would be super nice if someone could help me.

This is the link to the tensorboard tutorial I oriented towards.

Below you see the code. The example consists only of this file.

from __future__ import division, print_function, unicode_literals

# Handle arguments (before slow imports so --help can be fast)
import argparse

parser = argparse.ArgumentParser(
    description="Train a DQN net to play MsMacman.")
parser.add_argument("-n", "--number-steps", type=int, default=4000000,
                    help="total number of training steps")
parser.add_argument("-l", "--learn-iterations", type=int, default=4,
                    help="number of game iterations between each training step")
parser.add_argument("-s", "--save-steps", type=int, default=1000,
                    help="number of training steps between saving checkpoints")
parser.add_argument("-c", "--copy-steps", type=int, default=10000,
                    help="number of training steps between copies of online DQN to target DQN")
parser.add_argument("-r", "--render", action="store_true", default=False,
                    help="render the game during training or testing")
parser.add_argument("-p", "--path", default="my_dqn.ckpt",
                    help="path of the checkpoint file")
parser.add_argument("-t", "--test", action="store_true", default=False,
                    help="test (no learning and minimal epsilon)")
parser.add_argument("-v", "--verbosity", action="count", default=0,
                    help="increase output verbosity")
args = parser.parse_args()

from collections import deque
import gym
import numpy as np
import os
import tensorflow as tf

writer = tf.summary.FileWriter("/home/maggie/tbfiles/1")

env = gym.make("MsPacman-v0")
done = True  # env needs to be reset

# First let's build the two DQNs (online & target)
input_height = 88
input_width = 80
input_channels = 1
conv_n_maps = [32, 64, 64]
conv_kernel_sizes = [(8, 8), (4, 4), (3, 3)]
conv_strides = [4, 2, 1]
conv_paddings = ["SAME"] * 3
conv_activation = [tf.nn.relu] * 3

#tf.summary.histogram("activation", conv_activation)
n_hidden_in = 64 * 11 * 10  # conv3 has 64 maps of 11x10 each
n_hidden = 512
hidden_activation = tf.nn.relu
n_outputs = env.action_space.n  # 9 discrete actions are available
initializer = tf.contrib.layers.variance_scaling_initializer()

def q_network(X_state, name):
    prev_layer = X_state
    with tf.variable_scope(name) as scope:
        for n_maps, kernel_size, strides, padding, activation in zip(
                conv_n_maps, conv_kernel_sizes, conv_strides,
                conv_paddings, conv_activation):
            prev_layer = tf.layers.conv2d(
                prev_layer, filters=n_maps, kernel_size=kernel_size,
                strides=strides, padding=padding, activation=activation,

        last_conv_layer_flat = tf.reshape(prev_layer, shape=[-1, n_hidden_in])
        hidden = tf.layers.dense(last_conv_layer_flat, n_hidden,
        outputs = tf.layers.dense(hidden, n_outputs,
    trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
    trainable_vars_by_name = {var.name[len(scope.name):]: var
                              for var in trainable_vars}
    return outputs, trainable_vars_by_name

X_state = tf.placeholder(tf.float32, shape=[None, input_height, input_width,
                                            input_channels], name="x_state")
online_q_values, online_vars = q_network(X_state, name="q_networks/online")
target_q_values, target_vars = q_network(X_state, name="q_networks/target")

# We need an operation to copy the online DQN to the target DQN
copy_ops = [target_var.assign(online_vars[var_name])
            for var_name, target_var in target_vars.items()]
copy_online_to_target = tf.group(*copy_ops)

# Now for the training operations
learning_rate = 0.001
momentum = 0.95

with tf.variable_scope("train"):
    X_action = tf.placeholder(tf.int32, shape=[None], name="x_action")
    y = tf.placeholder(tf.float32, shape=[None, 1], name="labels")
    q_value = tf.reduce_sum(online_q_values * tf.one_hot(X_action, n_outputs),
                            axis=1, keep_dims=True)
    error = tf.abs(y - q_value)
    clipped_error = tf.clip_by_value(error, 0.0, 1.0)
    linear_error = 2 * (error - clipped_error)
    loss = tf.reduce_mean(tf.square(clipped_error) + linear_error, name="loss")

    global_step = tf.Variable(0, trainable=False, name='global_step')
    optimizer = tf.train.MomentumOptimizer(
        learning_rate, momentum, use_nesterov=True, name="xent")
    training_op = optimizer.minimize(loss, global_step=global_step)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

# Let's implement a simple replay memory
replay_memory_size = 20000
replay_memory = deque([], maxlen=replay_memory_size)

def sample_memories(batch_size):
    indices = np.random.permutation(len(replay_memory))[:batch_size]
    cols = [[], [], [], [], []]  # state, action, reward, next_state, continue
    for idx in indices:
        memory = replay_memory[idx]
        for col, value in zip(cols, memory):
    cols = [np.array(col) for col in cols]
    return (cols[0], cols[1], cols[2].reshape(-1, 1), cols[3],
            cols[4].reshape(-1, 1))

# And on to the epsilon-greedy policy with decaying epsilon
eps_min = 0.1
eps_max = 1.0 if not args.test else eps_min
eps_decay_steps = args.number_steps // 2

def epsilon_greedy(q_values, step):
    epsilon = max(eps_min, eps_max - (eps_max - eps_min) * step / eps_decay_steps)
    if np.random.rand() < epsilon:
        return np.random.randint(n_outputs)  # random action
        return np.argmax(q_values)  # optimal action

# We need to preprocess the images to speed up training
mspacman_color = np.array([210, 164, 74]).mean()

def preprocess_observation(obs):
    img = obs[1:176:2, ::2]  # crop and downsize
    img = img.mean(axis=2)  # to greyscale
    img[img == mspacman_color] = 0  # Improve contrast
    img = (img - 128) / 128 - 1  # normalize from -1. to 1.
    return img.reshape(88, 80, 1)

# TensorFlow - Execution phase
training_start = 10000  # start training after 10,000 game iterations
discount_rate = 0.99
skip_start = 90  # Skip the start of every game (it's just waiting time).
batch_size = 50
iteration = 0  # game iterations
done = True  # env needs to be reset

# We will keep track of the max Q-Value over time and compute the mean per game
loss_val = np.infty
game_length = 0
total_max_q = 0
mean_max_q = 0.0

with tf.Session() as sess:
    if os.path.isfile(args.path + ".index"):
        saver.restore(sess, args.path)
    while True:
        step = global_step.eval()
        if step >= args.number_steps:
        iteration += 1
        if args.verbosity > 0:
            print("\rIteration {}   Training step {}/{} ({:.1f})%   "
                  "Loss {:5f}    Mean Max-Q {:5f}   ".format(
                iteration, step, args.number_steps, step * 100 / args.number_steps,
                loss_val, mean_max_q), end="")
        if done:  # game over, start again
            obs = env.reset()
            for skip in range(skip_start):  # skip the start of each game
                obs, reward, done, info = env.step(0)
            state = preprocess_observation(obs)

        if args.render:

        # Online DQN evaluates what to do
        q_values = online_q_values.eval(feed_dict={X_state: [state]})
        action = epsilon_greedy(q_values, step)

        # Online DQN plays
        obs, reward, done, info = env.step(action)
        next_state = preprocess_observation(obs)

        # Let's memorize what happened
        replay_memory.append((state, action, reward, next_state, 1.0 - done))
        state = next_state

        if args.test:

        # Compute statistics for tracking progress (not shown in the book)
        total_max_q += q_values.max()
        game_length += 1
        if done:
            mean_max_q = total_max_q / game_length
            total_max_q = 0.0
            game_length = 0

        if iteration < training_start or iteration % args.learn_iterations != 0:
            continue  # only train after warmup period and at regular intervals

        # Sample memories and use the target DQN to produce the target Q-Value
        X_state_val, X_action_val, rewards, X_next_state_val, continues = (
        next_q_values = target_q_values.eval(
            feed_dict={X_state: X_next_state_val})
        max_next_q_values = np.max(next_q_values, axis=1, keepdims=True)
        y_val = rewards + continues * discount_rate * max_next_q_values

        # Train the online DQN
        _, loss_val = sess.run([training_op, loss], feed_dict={
            X_state: X_state_val, X_action: X_action_val, y: y_val})

        # Regularly copy the online DQN to the target DQN
        if step % args.copy_steps == 0:

        # And save regularlys
        if step % args.save_steps == 0:
            saver.save(sess, os.path.join(os.getcwd(), 'my_dqn.ckpt'))

        merged_summary = tf.summary.merge_all()

This the error I get from the terminal when I comment the line #tf.summary.histogram("activation", conv_activation) in. (line 48)

(pacman) maggie@neuronalresearch:~/Documents/AI/my_project_folder/my_project$ python tiny_dqn.py -v --number-steps 1000
Traceback (most recent call last):
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in make_tensor_proto
    str_values = [compat.as_bytes(x) for x in proto_values]
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 468, in <listcomp>
    str_values = [compat.as_bytes(x) for x in proto_values]
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
TypeError: Expected binary or unicode string, got <function relu at 0x7f8999cca048>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tiny_dqn.py", line 48, in <module>
    tf.summary.histogram("activation", conv_activation)
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/summary/summary.py", line 192, in histogram
    tag=tag, values=values, name=scope)
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 188, in _histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 513, in _apply_op_helper
    raise err
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 510, in _apply_op_helper
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 926, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant
    value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/opt/anaconda3/envs/pacman/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 472, in make_tensor_proto
    "supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'list'> to Tensor. Contents: [<function relu at 0x7f8999cca048>, <function relu at 0x7f8999cca048>, <function relu at 0x7f8999cca048>]. Consider casting elements to a supported type.
(pacman) maggie@neuronalresearch:~/Documents/AI/my_project_folder/my_project$ 

Original Thread

By anonymous    2018-05-14

So I was watching the Google Developers video on youtube Hands-on TensorBoard (TensorFlow Dev Summit 2017) and I have a lot of problem recreating his graph at time 6:01.

The following is my code:

import tensorflow as tf

#1. add some name for w and b
#2. apply name scope

def conv_layer(input, channels_in, channels_out, name = "conv"):
    with tf.name_scope(name): 
        w = tf.Variable(tf.zeros([5, 5, channels_in, channels_out]), name = "W")
        b = tf.Variable(tf.zeros([channels_out]), name = "B")
        conv = tf.nn.conv2d(input, w, strides=[1, 1, 1, 1], padding="SAME")
        act = tf.nn.relu(conv + b)
        return act

#1. add some name for w and b
#2. apply name scope

def fc_layer(input, channels_in, channels_out, name = "fc"):
    with tf.name_scope(name):
        w = tf.Variable(tf.zeros([channels_in, channels_out]), name = "W")
        b = tf.Variable(tf.zeros([channels_out]), name = "B")
        act = tf.nn.relu(tf.matmul(input,w) + b)
        return act

#1. add some name for placeholders, cov layer, fc, logits
#2. apply name scope

# Setup placeholders, and reshape the data
x = tf.placeholder(tf.float32, shape=[None, 784], name = "x")
y = tf.placeholder(tf.float32, shape=[None, 10], name = "labels")
x_image = tf.reshape(x, [-1, 28, 28, 1])

conv1 = conv_layer(x_image, 1, 32, "conv1")
pool1 = tf.nn.max_pool(conv1, ksize=[1,2,2,1], strides = [1,2,2,1], padding = "SAME")

conv2 = conv_layer(pool1, 32, 64, "conv2")
pool2 = tf.nn.max_pool(conv2, ksize=[1,2,2,1], strides = [1,2,2,1], padding = "SAME")
flattened = tf.reshape(pool2, [-1, 7*7*64])

fcl = fc_layer(flattened, 7*7*64, 1024, "fcl")
logits = fc_layer(fcl, 1024, 10, "fc2")

added name scope and changed the name for cross_entropyu

with tf.name_scope("xent"):
    xent = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels = y))

#cross_entropy = tf.reduce_mean(
#    tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = y))

with tf.name_scope("train"):
    train_step = tf.train.AdamOptimizer(1e-4).minimize(xent)

with tf.name_scope("accuracy"):
    correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(y,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

sess = tf.Session()

writer = tf.summary.FileWriter("/Users/jianxiongji/graphs/change3/")

And my graph looks like this, but what he had in his presentation is like this.

I am quite confused; maybe I missed something or just plainly wrong, but there was no error shown in the code above when I run it.

I wanna thank you all ahead of time for helping me. I would appreciate a lot if you can provide me some good tutorial or material on tensorboard.

Original Thread

Submit Your Video

If you have some great dev videos to share, please fill out this form.