该米arkov’s inequality is one of the most basic bounds and it assumes almost nothing about the random variable. The assumptions that Markov’s inequality makes is that the random variable \(X\) is non-negative \(X > 0\) and has a finite expectation \(\mathbb{E}\left[X\right] < \infty\). The Markov’s inequality is given by:
When we have information about the underlying distribution of a random variable, we can take advantage of properties of this distribution to know more about the concentration of this variable. Let’s take for example a normal distribution with mean \(\mu = 0\) and unit standard deviation \(\sigma = 1\) given by the probability density function (PDF) below:
乙y the application of the Chebyshev’s inequality we have: $$ P(\mid M_n – \mu \mid \geq \epsilon) \leq \frac{\sigma^2}{n\epsilon^2}$$ for any (fixed) \(\epsilon > 0\), as \(n\) increases, the right side of the inequality goes to zero. Intuitively, this means that for a large \(n\) the concentration of the distribution of \(M_n\) will be around \(\mu\).
Improving on Markov’s and Chebyshev’s with Chernoff Bounds
如私有值一个和乙are held private to the sole owner of it and where the result[R将知道的只是一个或双方的。
It seems very counterintuitive that a problem like that could ever be solved, but for the surprise of many people, it is possible to solve it on some security requirements. Thanks to the recent developments in techniques such as FHE (全同态加密),不经意传输,乱码电路,这样的问题开始变得现实生活中使用的实用,他们正在时下正在应用,如信息交换,安全的位置,广告,卫星轨道防撞等许多公司所采用
I’m not going to enter into details of these techniques, but if you’re interested in the intuition behind the OT (Oblivious Transfer), you should definitely read the amazing explanation done by Craig Gidneyhere。该[Re are also, of course, many different protocols for doing 2PC or MPC, where each one of them assumes some security requirements (semi-honest, malicious, etc), I’m not going to enter into the details to keep the post focused on the goal, but you should be aware of that.
问题:句子相似度
我们要实现什么是使用隐私保护的计算来计算句子之间的相似性,但不透露句子的内容。只是为了给一个具体的例子:鲍勃拥有一家公司,拥有许多不同的项目的句子,如描述:“此项目是为了建立深厚的学习情绪分析的框架,将用于鸣叫“爱丽丝谁拥有另一家竞争对手公司,拥有的也是类似的句子描述不同的项目。What they want to do is to jointly compute the similarity between projects in order to find if they should be doing partnership on a project or not, however, and this is the important point: Bob doesn’t want Alice to know the project descriptions and neither Alice wants Bob to be aware of their projects, they want to know the closest match between the different projects they run, butwithout disclosingthe project ideas (project descriptions).
句子相似度比较
Now, how can we exchange information about the Bob and Alice’s project sentences without disclosing information about the project descriptions ?
One naive way to do that would be to just compute the hashes of the sentences and then compare only the hashes to check if they match. However, this would assume that the descriptions are exactly the same, and besides that, if the entropy of the sentences is small (like small sentences), someone with reasonable computation power can try to recover the sentence.
对于这个问题的另一种方法(这是我们将要使用的方法),是句子中的句子的嵌入空间比较。我们只需要使用机器学习模型来创建句子的嵌入(我们将使用InferSentlater) and then compare the embeddings of the sentences. However, this approach also raises another concern: what if Bob or Alice trains a Seq2Seq model that would go from the embeddings of the other party back to an approximate description of the project ?
InferSentis an NLP technique for universal sentence representation developed by Facebook that uses supervised training to produce high transferable representations.
一个s you can see, if we have two unit vectors (vectors with norm 1), the two terms in the equation denominator will be 1 and we will be able to remove the entire denominator of the equation, leaving only:
一个s you can see, this function is pretty simple, it feeds the text into the model, and then it will divide the embedding vector by the embedding norm.
现在,对于实际的原因,我将在以后使用整数运算计算的相似性,但是,通过InferSent产生的嵌入当然真正的价值。出于这个原因,你会在下面的代码中看到,我们创建另一个函数scale the float values and remove the radix point和将它们转换为整数。还有另一个重要的问题,我们将在以后使用安全计算框架doesn’t allow signed integers的,所以我们还需要在0.0和1.0之间剪辑的嵌入值。这当然会引起一些逼近误差,但是,我们仍然可以得到裁剪和缩放以有限的精度(我使用的是14位的过程中积计算比例,以避免溢出问题后)后很好的近似值:亚洲金博宝
# This function will scale the embedding in order to # remove the radix point. def scale(embedding): SCALE = 1 << 14 scale_embedding = np.clip(embedding, 0.0, 1.0) * SCALE return scale_embedding.astype(np.int32)
# The list of Alice sentences alice_sentences = [ 'my cat loves to walk over my keyboard', 'I like to pet my cat', ] # The list of Bob sentences bob_sentences = [ 'the cat is always walking over my keyboard', ]
为了执行双方(Alice和Bob)之间的安全计算,我们将使用ABY框架。一个乙Y implements many difference secure computation schemes and allows you to describe your computation as a circuit like pictured in the image below, where the Yao’s Millionaire’s problem is described:
一个s you can see, we have two inputs entering in one GT GATE (greater than gate) and then a output. This circuit has a bit length of 3 for each input and will compute if the Alice input is greater than (GT GATE) the Bob input. The computing parties then secret share their private data and then can use arithmetic sharing, boolean sharing, or Yao sharing to securely evaluate these gates.
ABY是很容易使用,因为你可以描述你的输入,股票,盖茨和它会做休息,你如创建套接字通信信道,在需要的时候进行数据交换等。然而,实施完全是用C ++编写,并I’m not aware of any Python bindings for it (a great contribution opportunity).
Fortunately, there is an implemented example for ABY that can do dot product calculation for us, theexample is here。我不会在这里复制的例子,但只有一部分,我们必须改变读取嵌入矢量,我们之前创建的,而不是随机生成vectors and increasing the bit length to 32-bits.
正如你所看到的,我们得到了很好的近似,即使在低亚洲金博宝精度数学和无符号整数需求的存在。Of course that in real-life you won’t have the two values and vectors, because they’re supposed to be hidden, but the changes to accommodate that are trivial, you just need to adjust ABY code to load only the vector of the party that it is executing it and using the correct IP addresses/port of the both parties.
>>>进口numpy的作为NP >>>一个= 2 >>> B = 77232917 >>> NUM_DIGITS = INT(1 + B * np.log10的(a))>>>打印(NUM_DIGITS)23249425
该[Reason why this works is that the log base 10 of a number is how many times this number should be divided by 10 to get to 1, so you get the number of digits after 1 and just need to add 1 back.
另一个有趣的事实是,我们还可以得到这个非常大的数字的最后一位数字同样没有通过一致性评价整个号码。亚洲金博宝由于我们感兴趣的是数mod 10而我们知道,梅森素数有2形式77232917-1,我们可以检查的权力2ñhave an easy cycling pattern:
(...重复)
Which means that powers of 2 mod 10 repeats at every 4 numbers, thus we just need to compute 77,232,917 mod 4, which is 1. Given that所述部分2a77232917ends in 2 and when you subtract 1 you end up with 1 as the last digit, as you can confirm by looking at the整个号码(〜10Mb的压缩文件)。
Since本福德定律got some attention in the past years, I decided to make a list of the previous posts I made on the subject in the context of elections, fraud, corruption, universality and prime numbers:
其中一个Python包深学习,我真的很喜欢工作,是千层面和nolearn。千层面is based on Theano so the GPU speedups will really make a great difference, and their declarative approach for the neural networks creation are really helpful. The nolearn libary is a collection of utilities around neural networks packages (including Lasagne) that can help us a lot during the creation of the neural network architecture, inspection of the layers, etc.
我什么都在这个岗位展示,是如何建立一个简单的ConvNet架构与一些卷积和汇聚层。I’m also going to show how you can use a ConvNet to train a feature extractor and then use it to extract features before feeding them into different models like SVM, Logistic Regression, etc. Many people use pre-trained ConvNet models and then remove the last output layer to extract the features from ConvNets that were trained on ImageNet datasets. This is usually called transfer learning because you can use layers from other ConvNets as feature extractors for different problems, since the first layer filters of the ConvNets works as edge detectors, they can be used as general feature detectors for other problems.
加载MNIST数据集
该MNIST数据集is one of the most traditional datasets for digits classification. We will use a pickled version of it for Python, but first, lets import the packages that we will need to use:
一个s you can see, we are downloading the MNIST pickled dataset and then unpacking it into the three different datasets: train, validation and test. After that we reshape the image contents to prepare them to input into the Lasagne input layer later and we also convert the numpy array types to uint8 due to the GPU/theano datatype restrictions.
一个s you can see, in the parameter层我们定义与该层的名称/类型的元组的字典,然后我们定义为这些层的参数。我们在这里的结构是利用两个卷积层用poolings然后完全连接层(致密层)和输出层。也有一些层之间遗失,漏失层为正则该随机设定的输入值为零,以避免过度拟合(见下面的图)。
一个s you can see, the nolearnplot_conv_weights地块所有过滤器呈现我们指定的层。
Theano层的功能和特征提取
Now it is time to create theano-compiled functions that will feed-forward the input data into the architecture up to the layer you’re interested. I’m going to get the functions for the output layer and also for the dense layer before the output layer:
正如你可以看到,我们有一个叫做现在二人theano功能f_output和f_dense(for the output and dense layers). Please note that in order to get the layers here we are using a extra parameter called “deterministic“, this is to avoid the dropout layers affecting our feed-forward pass.
现在,我们可以示例实例转换到输入格式,然后将其馈送到用于输出层theano功能:
instance = X_test[0][None, :, :] %timeit -n 500 f_output(instance) 500 loops, best of 3: 858 µs per loop
正如你所看到的,f_output功能需要的858微秒的平均水平。我们也可以绘制实例的输出层激活:
pred = f_output(instance) N = pred.shape[1] plt.bar(range(N), pred.ravel())
上面的代码将创建以下情节:
Output layer activations.
一个s you can see, the digit was recognized as the digit 7. The fact that you can create theano functions for any layer of the network is very useful because you can create a function (like we did before) to get the activations for the dense layer (the one before the output layer) and you can use these activations as features and use your neural network not as classifier but as a feature extractor. Let’s plot now the 256 unit activations for the dense layer:
PRED = f_dense(实例)N = pred.shape [1] plt.bar(范围(N),pred.ravel())
谷歌的S2库是真正的宝贝,不仅是因为它的功能对空间索引,还因为它是在4年多前发布的,并没有得到应有的重视图书馆。在S2库是由谷歌本身的谷歌地图,MongoDB的发动机也被使用Foursquare的,但你不会找到除了通过四方,一纸库中的任何地方的任何文件或物品Google presentation和the source code comments. You’ll also struggle to find bindings for the library, the official repository has missing Swig files for the Python library and thanks tosome forks我们可以有一个部分结合Python语言(我将用它为这个职位)。我听说,谷歌正在积极库现在,我们可能很快会得到有关它的更多细节时,他们发表本作品,但我决定,为什么我觉得这个库是分享关于图书馆的原因一些例子非常酷。
该Containment query for arbitrary regions are really fast
该S2 library starts by projecting the points/regions of the sphere into a cube, and each face of the cube has a quad-tree where the sphere point is projected into. After that, some transformation occurs (for more details on why, see theGoogle presentation)和空间离散,使得细胞在一个列举的后Hilbert Curve,和this is why this library is so nice, the Hilbert curve is a space-filling curve that converts multiple dimensions into one dimension that has an special spatial feature: it preserves the locality.
In the image below, the point in the very beggining of the Hilbert curve (the string) is located also in the very beginning along curve (the curve is represented by a long string in the bottom of the image):
Hilbert Curve
Now in the image below where we have more points, it is easy to see how the Hilbert curve is preserving the spatial locality. You can note that points closer to each other in the curve (in the 1D representation, the line in the bottom) are also closer in the 2D dimensional space (in the x,y plane). However, note that the opposite isn’t quite true because you can have 2D points that are close to each other in the x,y plane that aren’t close in the Hilbert curve.
首先,我们定义了一个S2LatLngRect这是界定我们想要覆盖该区域的矩形。还有一些其他类,您可以使用(构建多边形为例)时,S2RegionCoverer与使用类作品S2Region类作为基类。定义的矩形后,我们实例S2RegionCoverer和then set the aforementioned min/max levels and the max number of the cells that we want the approximation to generate.
有很多东西在S2 API,我真实的ly recommend you to explore and read the source-code, it is really helpful. The S2 cells can be used for indexing and in key-value databases, it can be used on B Trees with really good efficiency and also even for Machine Learning purposes (which is my case), anyway, it is a very useful tool that you should keep in your toolbox. I hope you enjoyed this little tutorial !
因此,在数学我们的理念普遍性in which we have laws like the大数定律中,本福德定律(我引用了很多以前的帖子),该Central limit theorem和许多其他法律行为像数学世界的物理定律。这些法律不是我们的发明,我的意思是,这些概念都是我们的发明但法律per seare universal, they are true no matter where you are on the earth or if you live far away on the universe. And that is why弗兰克·德雷克,SETI的创始人之一,也是在搜寻地外文明计划的先驱之一来使用素数(普遍性的另一个例子)与遥远的世界进行沟通这一高招。弗兰克·德雷克有了想法是使用素数隐藏(实际上没有躲,而是使不言而喻的,你会明白以后)发送的图像尺寸的图像尺寸本身。
因此,假设您收到一条消息,是划线和点的序列,例如“-.-。-。-.--- ...... -.-”是重复一个短暂的停顿之后,然后一遍又一遍。让我们假设这个消息具有1679个符号的大小。所以,你开始进行号码分析,这实际上是一个半素数(在密码系统中使用的相同,一个数字,是两个素数的乘积),其可以在素因子作为被分解23*73= 1679,这是它的因素在黄金的因素(实际上所有的数字都只有一个单一的集合,是唯一的质因子的唯一方法,请参阅算术基本定理)。So, since there are only two prime factors, you will try to reshape the signal in a 2D image and this image can have the dimension of 23×73 or 73×23, when you arrange the image in one of these dimensions you’ll see that the image makes sense and the other will be just a random and strange sequence. By using prime numbers (or semiprimes) you just used the total image size to define the only two possible ways of arranging the image dimension.
近20年后,研究人员在麻省理工学院的实验室信息和决策系统终于回答了这个问题。遗憾的是,答案,这是他们在五月报道with one paper在社会为工业和优化应用数学(SIAM)会议,是否定的。对于任意多项式函数 - 也就是,在该变量被升高到积分指数的函数,诸如化合物13x4+ 7xy2+ yz — determining whether it’s convex is what’s calledNP难。这意味着,在世界上最强大的计算机无法提供在合理时间内答复。
在同一个会议,但是,电气工程MIT教授和计算机科学巴勃罗Parrilo和他的研究生阿米尔·阿里·艾哈迈迪,两人的第一篇论文四个作者,发现在许多情况下,可以有效地确定的属性,被称为平方和凸性,是一种可行的替代凸性。米oreover, they provide an algorithm for determining whether an arbitrary function has that property.