Convolutional hypercolumns in Python

如果你是以下一些机器学习的消息,你肯定看到了瑞恩·达尔所做的工作Automatic ColorizationHacker News commentsreddit的评论)。这个惊人的工作是采用像素hypercolumn从VGG-16网络,以便提取上色图像的信息。Samimalso used the network to process Black & White video frames and produced the amazing video below:




Many algorithms using features from CNNs (Convolutional Neural Networks) usually use the last FC (fully-connected) layer features in order to extract information about certain input. However, the information in the last FC layer may be too coarse spatially to allow precise localization (due to sequences of maxpooling, etc.), on the other side, the first layers may be spatially precise but will lack semantic information. To get the best of both worlds, the authors of thehypercolumn paperdefine the hypercolumn of a pixel as the vector of activations of all CNN units “above” that pixel.

Hypercolumn Extraction
Hypercolumn Extraction通过Hypercolumns为对象分割和细粒度本地化

在hypercolumns的提取的第一步是将图像馈送到CNN(Convolutional Neural Network)并提取特征图激活针对图像的每个位置。最棘手的部分是,当特征地图比输入图像更小,例如一个池操作之后,该论文的作者,然后做特征映射的双线性采样,以保持特征地图上输入相同的大小。也有与FC的问题(fully-connected)layers, because you can’t isolate units semantically tied only to one pixel of the image, so the FC activations are seen as 1×1 feature maps, which means that all locations shares the same information regarding the FC part of the hypercolumn. All these activations are then concatenated to create the hypercolumn. For instance, if we take the VGG-16 architecture to use only the first 2 convolutional layers after the max pooling operations, we will have a hypercolumn with the size of:



128 filterssecond conv layer before pooling)=192 features

This means that each pixel of the image will have a 192-dimension hypercolumn vector. This hypercolumn is really interesting because it will contain information about the first layers (where we have a lot of spatial information but little semantic) and also information about the final layers (with little spatial information and lots of semantics). Thus this hypercolumn will certainly help in a lot of pixel classification tasks such as the one mentioned earlier of automatic colorization, because each location hypercolumn carries the information about what this pixel semantically and spatially represents. This is also very helpful on segmentation tasks (you can see more about that on the original paper introducing the hypercolumn concept).



Before being able to extract the hypercolumns, we’ll setup the VGG-16 pre-trained network, because you know, the price of a good GPU (I can’t even imagine many of them) here in Brazil is very expensive and I don’t want to sell my kidney to buy a GPU.


要设置上Keras预训练VGG-16网络,你需要下载文件的权重from here(具有大约500MB vgg16_weights.h5文件),然后设置的体系结构和加载使用Keras下载的权重(more information about the weights file and architecturehere):

from matplotlib import pyplot as plt import theano import cv2 import numpy as np import scipy as sp from keras.models import Sequential from keras.layers.core import Flatten, Dense, Dropout from keras.layers.convolutional import Convolution2D, MaxPooling2D from keras.layers.convolutional import ZeroPadding2D from keras.optimizers import SGD from sklearn.manifold import TSNE from sklearn import manifold from sklearn import cluster from sklearn.preprocessing import StandardScaler def VGG_16(weights_path=None): model = Sequential() model.add(ZeroPadding2D((1,1),input_shape=(3,224,224))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(64, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), stride=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(128, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), stride=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(256, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), stride=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), stride=(2,2))) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(ZeroPadding2D((1,1))) model.add(Convolution2D(512, 3, 3, activation='relu')) model.add(MaxPooling2D((2,2), stride=(2,2))) model.add(Flatten()) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(4096, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(1000, activation='softmax')) if weights_path: model.load_weights(weights_path) return model

As you can see, this is a very simple code to declare the VGG16 architecture and load the pre-trained weights (together with Python imports for the required packages). After that we’ll compile the Keras model:

model = VGG_16('vgg16_weights.h5') sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True) model.compile(optimizer=sgd, loss='categorical_crossentropy')

Now let’s test the network using an image:

im_original = cv2.resize(cv2.imread('madruga.jpg'), (224, 224)) im = im_original.transpose((2,0,1)) im = np.expand_dims(im, axis=0) im_converted = cv2.cvtColor(im_original, cv2.COLOR_BGR2RGB) plt.imshow(im_converted)

Image used

Image used

As we can see, we loaded the image, fixed the axes and then we can now feed the image into the VGG-16 to get the predictions:

OUT = model.predict(IM)plt.plot(out.ravel())


As you can see, these are the final activations of the softmax layer, the class with the “jersey, T-shirt, tee shirt” category.

Extracting arbitrary feature maps

Now, to extract the feature map activations, we’ll have to being able to extract feature maps from arbitrary convolutional layers of the network. We can do that by compiling a Theano function using theget_output()Keras的方法,如在下面的例子:

get_feature = theano.function([model.layers[0].input], model.layers[3].get_output(train=False), allow_input_downcast=False) feat = get_feature(im) plt.imshow(feat[0][2])




get_feature = theano.function([model.layers[0].input], model.layers[15].get_output(train=False), allow_input_downcast=False) feat = get_feature(im) plt.imshow(feat[0][13])

More semantic feature maps

More semantic feature maps.


Extracting hypercolumns

Now finally let’s extract the hypercolumns of arbitrary set of layers. To do that, we will define a function to extract these hypercolumns:

DEF extract_hypercolumn(型号,layer_indexes,实例):层= [model.layers [LI] .get_output(列车=假),用于在layer_indexes LI] get_feature = theano.function([model.layers [0]。输入],层allow_input_downcast =假)feature_maps = get_feature(实例)hypercolumns = []用于feature_maps convmap:用于convmap FMAP [0]:=放大的sp.misc.imresize(FMAP,大小=(224,224),模式= “F”,口译= '双线性')hypercolumns.append(放大的)返回np.asarray(hypercolumns)

我们可以看到,这个函数将预计三个标准ameters: the model itself, an list of layer indexes that will be used to extract the hypercolumn features and an image instance that will be used to extract the hypercolumns. Let’s now test the hypercolumn extraction for the first 2 convolutional layers:

layers_extract = [3, 8] hc = extract_hypercolumn(model, layers_extract, im)

就这样,我们提取的每个像素的hypercolumn载体。这个“HC”可变的形状是:(192L,224L,224L),这意味着我们为224×224像素中的每一个(总共50176个像素与每个192 hypercolumn功能)192维hypercolumn。

Let’s plot the average of the hypercolumns activations for each pixel:

ave = np.average(hc.transpose(1, 2, 0), axis=2) plt.imshow(ave)
Hypercolumn average for layers 3 and 8.
Hypercolumn average for layers 3 and 8.

Ad you can see, those first hypercolumn activations are all looking like edge detectors, let’s see how these hypercolumns looks like for the layers 22 and 29:

layers_extract = [22,29] HC = extract_hypercolumn(模型,layers_extract,1M)平均= np.average(hc.transpose(1,2,0),轴= 2)plt.imshow(AVE)
Hypercolumn average for the layers 22 and 29.
Hypercolumn average for the layers 22 and 29.

As we can see now, the features are really more abstract and semantically interesting but with spatial information a little fuzzy.

Remember that you can extract the hypercolumns using all the initial layers and also the final layers, including the FC layers. Here I’m extracting them separately to show how they differ in the visualization plots.

Simple hypercolumn pixel clustering

Now, you can do a lot of things, you can use these hypercolumns to classify pixels for some task, to do automatic pixel colorization, segmentation, etc. What I’m going to do here just as an experiment, is to use the hypercolumns (from the VGG-16 layers 3, 8, 15, 22, 29) and then cluster it using KMeans with 2 clusters:

m = hc.transpose(1 2 0)。重塑(50176 1)kmeans= cluster.KMeans(n_clusters=2, max_iter=300, n_jobs=5, precompute_distances=True) cluster_labels = kmeans .fit_predict(m) imcluster = np.zeros((224,224)) imcluster = imcluster.reshape((224*224,)) imcluster = cluster_labels plt.imshow(imcluster.reshape(224, 224), cmap="hot")
KMeans clustering using hypercolumns.
KMeans clustering using hypercolumns.


I hope you liked it !

– Christian S. Perone

Cite this article as: Christian S. Perone, "Convolutional hypercolumns in Python," in亚洲金博宝未知领域,11/01/2016,//

Real time Drone object tracking using Python and OpenCV

After flying this past weekend (together withGabrieland莱昂德罗)with Gabriel’s drone (which is an handmade APM 2.6 based quadcopter) in our town (Porto Alegre, Brasil), I decided to implement a tracking for objects using OpenCV and Python and check how the results would be using simple and fast methods like均值漂移。The result was very impressive and I believe that there is plenty of room for optimization, but the algorithm is now able to run in real time using Python with good results and with a Full HD resolution of 1920×1080 and 30 fps.

Here is the video of the flight that was piloted byGabriel



  • A ROI (Region of Interest) is defined, in this case the building that I want to track
  • The normalized histogram and back-projection are calculated
  • The Meanshift algorithm is used to track the ROI

The entire code for the tracking is described below:

import numpy as np import cv2 def run_main(): cap = cv2.VideoCapture('upabove.mp4') # Read the first frame of the video ret, frame = # Set the ROI (Region of Interest). Actually, this is a # rectangle of the building that we're tracking c,r,w,h = 900,650,70,70 track_window = (c,r,w,h) # Create mask and normalized histogram roi = frame[r:r+h, c:c+w] hsv_roi = cv2.cvtColor(roi, cv2.COLOR_BGR2HSV) mask = cv2.inRange(hsv_roi, np.array((0., 30.,32.)), np.array((180.,255.,255.))) roi_hist = cv2.calcHist([hsv_roi], [0], mask, [180], [0, 180]) cv2.normalize(roi_hist, roi_hist, 0, 255, cv2.NORM_MINMAX) term_crit = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 80, 1) while True: ret, frame = hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) dst = cv2.calcBackProject([hsv], [0], roi_hist, [0,180], 1) ret, track_window = cv2.meanShift(dst, track_window, term_crit) x,y,w,h = track_window cv2.rectangle(frame, (x,y), (x+w,y+h), 255, 2) cv2.putText(frame, 'Tracked', (x-25,y-10), cv2.FONT_HERSHEY_SIMPLEX, 1, (255,255,255), 2, cv2.CV_AA) cv2.imshow('Tracking', frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() if __name__ == "__main__": run_main()

I hope you liked it !