Voynich Manuscript: word vectors and t-SNE visualization of some patterns

更新17/01:reddit discussionthread.

更新19/01:黑客新闻thread.

食品法典委员会

voynich_headerTheVoynich Manuscript是写在一个未知的系统和碳追溯至15世纪初(一四〇四年至1438年)手写的抄本。虽然手稿已经研究过第一次世界大战和II的一些著名密码学家,还没有人破译它。这份手稿是已知的两种不同的语言(语言A和语言B)写入,它也被称为由一群人来写。手稿本身就是总是有很多不同的假设的问题,包括一个我喜欢这是“文化灭绝”假说最大,在2014年支持Stephen Bax. This hypothesis states that the codex isn’t ciphered, it states that the codex was just written in an unknown language that disappeared due to a culture extinction. In 2014, Stephen Bax proposed a临时,部分解码手稿,他的演讲的视频非常有趣,我真的建议你看,如果你喜欢这个抄本。亚洲金博宝还有手稿地感谢许多人做这个工作,因为很久以前的辛勤工作的转录。

词矢量

我当我听到斯蒂芬Bax蛋白的工作想法是尝试使用捕捉文本模式word2vec. Word embeddings are created by using a shallow neural network architecture. It is a unsupervised technique that uses supervided learning tasks to learn the linguistic context of the words. Here is a visualization of this architecture from theTensorFlow网站:

SOFTMAX-nplm

These word vectors, after trained, carry with them a lot of semantic meaning. For instance:

word2vecqueen

我们可以看到,这些载体可以在向量操作被用于提取关于捕捉的语言语义的规律性的信息。这些载体也近似相同含义的词语一起,允许相似查询像在下面的例子:

>>> model.most_similar("man") [(u'woman', 0.6056041121482849), (u'guy', 0.4935004413127899), (u'boy', 0.48933547735214233), (u'men', 0.4632953703403473), (u'person', 0.45742249488830566), (u'lady', 0.4487500488758087), (u'himself', 0.4288588762283325), (u'girl', 0.4166809320449829), (u'his', 0.3853422999382019), (u'he', 0.38293731212615967)] >>> model.most_similar("queen") [(u'princess', 0.519856333732605), (u'latifah', 0.47644317150115967), (u'prince', 0.45914226770401), (u'king', 0.4466976821422577), (u'elizabeth', 0.4134873151779175), (u'antoinette', 0.41033703088760376), (u'marie', 0.4061327874660492), (u'stepmother', 0.4040161967277527), (u'belle', 0.38827288150787354), (u'lovely', 0.38668593764305115)]

词矢量也可以使用(惊讶)翻译,这是单词矢量的,我认为最重要的用来理解文本时的功能,我们知道的一些话翻译。我假装尝试使用由Stephen Bax蛋白中发现的话在未来是否有可能捕捉到一些改造,可能会导致发现与其他语言类似的结构。此功能的一个很好的可视化是一个从纸张下方“Exploiting Similarities among Languages for Machine Translation“:

译

This visualization was made using gradient descent to optimize a linear transformation between the source and destination language word vectors. As you can see, the structure in Spanish is really close to the structure in English.

EVA Transcription

为了训练这个模型,我不得不解析,并从EVA中提取转录(欧伏尼契字母)能够养活伏尼契句子了word2vecmodel. This EVA transcription has the following format:

<!f1r.P1.1; H> fachys.ykal.ar.ataiin.shol.shory.cth res.y.kor.sholdy! -   fachys.ykal.ar.ataiin.shol!.shory.cthorys.y.kor.sholdy  -  !!!FYA ys.ykal.ar.ytaiin.shol.shory * K * res.y kor.sholdy  -   fachys.ykal.ar.ataiin.shol.shory.cth res.y,kor.sholdy  -   FYA ys.ykal.ar.ytaiin.shol.shory.***!r*s.y.kor.sholdo*- #  sory.ckhar.o!r.y.kair.chtaiin.shar.are.cthar.cthar.dan!-  sory.ckhar.o.r.y.kain.shtaiin.shar.ar*.cthar.cthar.dan!-  sory.ckhar.o!r!y.kair.chtaiin.shor.ar!.cthar.cthar.dana-  sory.ckhar.o!r,y.kair.chtaiin.shar.are.cthar.cthar,dan!-  sory.ckhar.o!r!y.kair.chtaiin.shor.ary.cthar.cthar.dan*-

The first data between “<” and “>” has information about the folio (page), line and author of the transcription. The transcription block above is the transcription for the first two lines of the first folio of the manuscript below:

Part of the
在“F1R”的一部分,

正如你所看到的,EVA包含一些编码字符,例如像“!”,“*”和它们都具有一定的意义,如通知,作者这样做,翻译是不知道在那个位置的字符,等等。EVA还包含从同一线对折的不同作者的转录。

To convert this transcription to sentences I used only lines where the authors were sure about the entire line and I used the first line where the line satisfied this condition. I also did some cleaning on the transcription to remove the drawings names from the text, like: “text.text.text-{plant}text” -> “text text texttext”.

从EVA成绩单与word2vec型号兼容的句子这种转换之后,我训练的模式,对稿件的话提供100维的词矢量。

采用t-SNE向量空间可视化

After training word vectors, I created a visualization of the 100-dimensional vectors into a 2D embedding space using t-SNE algorithm:

tsne-vis1

As you can see there are a lot of small clusters and there visually two big clusters, probably accounting for the two different languages used in the Codex (I still need to confirm this regarding the two languages aspect). After clustering it with DBSCAN (using the original word vectors, not the t-SNE transformed vectors), we can clearly see the two major clusters:

tsne-VIS-DBSCAN

Now comes the really interesting and useful part of the word vectors, if use a star name from the folio below (it’s pretty obvious why it is know that this is probably a star name):

>>> w2v_model.most_similar( “octhey”)[( 'qoekaiin',0.6402825713157654),( 'otcheody',0.6389687061309814),( 'ytchos',0.566596269607544),( 'ocphy',0.5415685176849365),( 'dolchedy',0.5343093872070312),( 'aiicthy',0.5323750376701355),( 'odchecthy',0.5235849022865295),( 'okeeos',0.5187858939170837),( 'cphocthy',0.5159749388694763),( 'oteor',0.5050544738769531)]

我得到真正有趣的类似的话,例如像在ocphy和其他密切星名:

明星

这也从对开48返回单词“qoekaiin”,先于同一颗恒星的名字:

foliostars

As you can see, word vectors are really useful to find some linguistic structures, we can also create another plot, showing how close are the star names in the 2D embedding space visualization created using t-SNE:

star_clus

正如你所看到的,我们放大的恒星簇大,我们可以看到,他们真的都集合在向量空间。这些表示可用于例如从草药节等推断高原名

我的想法是,以显示字是多么有用向量来分析未知的法典文本,我希望你喜欢,我希望,这可能是为其他人怎么也有兴趣在这个惊人的手稿在某种程度上是有用的。

- 基督教S. Perone

Cite this article as: Christian S. Perone, "Voynich Manuscript: word vectors and t-SNE visualization of some patterns," inTerra Incognita, 16/01/2016,//www.cpetem.com/2016/01/voynich-manuscript-word-vectors-and-t-sne-visualization-of-some-patterns/.

参考

Voynich Digitalization

Stephen Bax Site

勒网站的Zandbergen