Posts tagged with: machine learning

YourTTS – Swiss Knife for Text-to-Speech

👉 Try out YourTTS demo 

👉 Visit YourTTS project page 

👉 Try YourTTS on Colab

👉 Try voice conversion with YourTTS on Colab

YourTTS

The recent surge of new end-to-end deep learning models has enabled new and exciting Text-to-Speech (TTS) use-cases with impressive natural-sounding results. However, most of these models are trained on massive datasets (20-40 hours) recorded with a single speaker in a professional environment. In this setting, expanding your solution to multiple languages and speakers is not feasible for everyone. Moreover, it is particularly tough for low-resource languages not commonly targeted by mainstream research. To get rid of these limitations and bring zero-shot TTS to low resource languages, we built  YourTTS, which can synthesize voices in multiple languages and reduce data requirements significantly by transferring knowledge among languages in the training set.  For instance, we can easily introduce Brazilian Portuguese to the model with a single speaker dataset by co-training with a larger English dataset. It makes the model speak Brazilian Portuguese with voices from the English dataset, or we can even introduce new speakers by zero-shot learning on the fly.

In “YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone” we introduce the YourTTS that targets,

  • Multi-Lingual TTS. Synthesizing speech in multiple languages with a single model.
  • Multi-Speaker TTS. Synthesizing speech with different voices with a single model.
  • Zero-Shot learning. Adapting the model to synthesize the speech of a novel speaker without re-training the model.
  • Speaker/language adaptation. Fine-tuning a pre-trained model to learn a new speaker or language. (Learn Turkish from a relatively smaller dataset by transferring knowledge from learned languages) 
  • Cross-language voice transfer. Transferring a voice from its original language to a different language. (Using the voice of an English speaker in French)
  • Zero-shot voice conversion. Changing the voice of a given speech clip.

Model Architecture

YourTTS is an extension of our previous work  SC-GlowTTS. It uses the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) model as the backbone architecture and builds on top of it. We use a larger text encoder than the original model. Also, YourTTS employs a separately trained speaker encoder model to compute the speaker embedding vectors (d-vectors) to pass speaker information to the rest of the model. We use the H/ASP model as the speaker encoder architecture. See the figure below for the overall model architecture in training (right) and inference (left).

VITS is a peculiar TTS model as it employs different deep-learning techniques together (adversarial learning, normalizing flows, variational auto-encoders, transformers) to achieve high-quality natural-sounding output. It is mainly built on the GlowTTS model. The GlowTTS is light, robust to long sentences, converges rapidly, and is backed up by theory since it directly maximizes the log-likelihood of speech with the alignment. However, its biggest weakness is the lack of naturalness and expressivity of the output.

VITS improves on it by introducing specific updates. First, it replaces the duration predictor with a stochastic duration predictor that better models the variability in speech. Then, it connects a HifiGAN vocoder to the decoder’s output and joins the two with a variational autoencoder (VAE). That allows the model to train in an end2end fashion and find a better intermediate representation than traditionally used mel-spectrograms. This results in high fidelity and more precise prosody, achieving better MOS values reported in the paper.

Note that both GlowTTS and VITS implementations are available on 🐸TTS.

Dataset

We combined multiple datasets for different languages. We used VCTK and LibriTTS for English (multispeaker datasets), TTS-Portuguese Corpus (TPC) for Brazilian Portuguese, and the French subset of the M-AILABS dataset (FMAI).

We resample the audio clips to 16 kHz, apply voice activity detection to remove silences, and apply RMS volume normalization before passing them to the speaker encoder. 

Training 

We train YourTTS incrementally, starting from a single speaker English dataset and adding more speakers and languages along the way. We start from a pre-trained model on the LJSpeech dataset for 1M steps and continue with the VCTK dataset for 200K steps. Next, we randomly initialize the new layers introduced by the YourTTS model on the VITS model. Then we add the other datasets one by one and train for ~120K steps with each new dataset. 

Before we report results on each dataset, we also fine-tune the final model with speaker encoder loss (SCL) on that particular dataset. SCL compares output speech embeddings with the ground truth embeddings computed by the speaker encoder with cosine similarity loss.

We used a single V100 GPU and used a batch size of 64. We used the AdamW optimizer with beta values 0.8 and 0.99 and a learning rate of 0.0002 decaying exponentially with gamma 0.999875 per iteration. We also employed a weight decay of 0.01. 

Results 

We run “mean opinion score” (MOS) and similarity MOS tests to evaluate the model performance. Also, we use the speaker encoder cosine similarity (SECS) to measure the similarity between the predicted outputs and the actual audio clips of a target speaker. We used a 3rd party library for SECS to be compatible with the previous work. We avoid details of our experiments for the sake of brevity. Please refer to the paper to see the details. 

Table (1) above shows our results on different datasets. Exp1 is trained with only the VCTK. Exp2. is with the VCTK and TPC. Then, we add the FMAI, LibriTTS for Exp3. and Exp4, respectively. The ground truth row reports the values for the real speaker clips in respective datasets. Finally, we compare our results with AttentronZS and SC-GlowTTS. Note that SC-GlowTTS is our previous work that leads our way to the YourTTS (You can find its implementation under 🐸TTS). We achieve significantly better results than the comparing work in our experiments. MOS values are on-par or even better than the ground truth in some cases, which is even surprising for us to see.

Table (2) depicts the zero-shot voice conversion (ZSVC) results between languages and genders by the speaker embeddings. For ZSVC, we pass the given speech clip from the posterior encoder to compute the hidden representation and re-run the model in the inference mode again conditioned on the target speaker’s embedding. You see in the table the model’s performance between languages and genders. For instance, “ en-pt” shows the results for converting the voice of a Portuguese speaker by conditioning on an English speaker. And “M-F” offers the conversion of a Male speaker to a Female speaker. 

Table (3) yields the results for the speaker adaptation experiments where we fine-tune the final YourTTS model by SCL on different length clips of a particular novel speaker.  For instance, the top row shows the results for a model trained on a male English speaker with 61 seconds of an audio clip. GT is the ground truth, ZS is zero-shot with only the speaker embeddings, and FT is fine-tuning.  These results show that our model can achieve high similarity when fine-tuned with only 20 seconds of audio sample from a speaker in case mere use of speaker embeddings is not enough to produce high-quality results. 

Due to the time and space constraints in the paper, we could not expand the experiments to all the possible use-cases of YourTTS. We plan to include those in our future study and add new capabilities to YourTTS that would give more control over the model.

Try out YourTTS 

Visit our demo page accompanying this blog post and give YourTTS a try right on your browser. 

YourTTS is also available in 🐸TTS with a training recipe and a pre-trained model. You can train your own model, synthesize voice with the pre-trained model or finetune it with your dataset.

Ethical Concerns

We are well aware that the expansion of the TTS technology enables various kinds of malign uses of the technology. Therefore, we also actively study different approaches to prevent or at the very least put more fences along the way of the misuse of the TTS technology. 

To exemplify this, on our demo page, we add background music to avert the unintended use of the voice clips on different platforms.

If you also want to contribute to our research & discussion in this field, join us here.

Conclusion

YourTTS can achieve competitive results on multi-lingual, multi-speaker TTS, and zero-shot learning. It also allows cross-language voice transfer, learning new speakers and languages from relatively more minor datasets than the traditional TTS models.

We are excited to present YourTTS and see all the different use-cases that 🐸 Community will apply. As always, feel free to reach out for any feedback.

Share

Notes on GPT-3

Original paper : https://arxiv.org/abs/2005.14165

  • It uses the same architecture as GPT-2.
  • The largest model uses 170B parameters and trained with a batch size of 3.2 million. (Wow!).
  • Training cost exceeds $12M.
  • “Taking all these runs into account, the researchers estimated that building this model generated over 78,000 pounds of CO2 emissions in total—more than the average American adult will produce in two years.”[link]
  • They used a system with more than 285K CPU cores 10K GPUs and 400 Gigabits network connectivity per machine. (Too much pollution).
  • The model is trained on the whole Wikipedia, 2 different Book datasets, and Common Crawl.
  • It learns different tasks with a task description, example(s), and a prompt.
    • Task description is a definition of target action, like “Translate from English to France…”
    • Example(s) is a sample or a set of samples used in one-shot or few-shot learning settings.
    • Prompt is an input on which the target action is performed.
  • The larget the model, the better the results.
  • They perform zero-shot, one-shot, and few-shot learning with the pre-trained language model for specific tasks.
  • At the Question Answering task, it outperforms SOTA models trained with the source documents.
  • At the Translation task, it performs close to SOTA. It is better at translating a language to English than otherwise, given it is trained on an English corpus.
  • Winograd task is determining which word a pronoun refers to in a sentence.
  • Physical Q&A is asking questions about grounded knowledge about the physical world. Outperforms SOTA in all the learning settings.
  • Reading Comprehension is asking questions about a given document. Performs poorly in relation to SOTA.
  • Causal Reasoning is giving a sentence and asking the most possible outcome.
  • Natural Language Interference is a task to determine if the 2nd sentence is matching or conflicting with the 1st sentence. It performs well here.
  • at arithmetic operations, small models perform poorly and large models perform good, especially at summation. They discuss that multiplication is a harder operation.
  • At SAT, it performs better than an average student.
  • Human accuracy to detect the articles written by the model is close to the random guess with the largest model.

I believe GPT-3 is not capable of “reasoning” in contrary to the common belief. GPT-3 rather constitutes an efficient storage mechanism for data it is trained with. At inference time, the model determines the output by finding the samples that are most relevant to the given task and interpolating them.

This is more apparent at the arithmetics task. The summation task is much easier since it is easier to memorize the whole table of information from the training data. And as it gets harder with multiplication the model struggles to fetch the relevant information and the performance drops.

You can also observe that when you take sentences from the generated articles in the paper and google them. Although they do not exactly match any article on the Web, you see very similar content and sometimes sentences that are different by only a couple of words.

Share


Why mere Machine Learning cannot predict Bitcoin price

Lately, I study time series to see something more out the limit of my experience. I decide to use what I learn in cryptocurrency price predictions with a hunch of being rich. Kidding? Or not :).  As I see more about the intricacies of the problem I got deeper and I got a new challenge out of this. Now, I am in a process of creating something new using traditional machine learning to latest reinforcement learning achievements.

So the story aside, I like to see if an AI bot trading without manual help is possible or is a luring dream. Lately, I read a lot about the topic  from traditional financial technical analysis to latest ML solutions. What I see at the ML front is many people claim to use lazy ML with success and sell deceitful dreams.What I call lazy ML is, downloading data , training the model and done. We are rich!! What I really experience is they have false conclusion induced by false interpretations. And the bad side of this, many other people try to replicate their results (aka beginner me) and waste a lot of time. Here, I like to show a particular mistake in those works with a accompanying code helping us to realize the problem better off.

Briefly, this work illustrates a simple supervised setting where a model predicts the next Bitcoin move given the current state.  Here is the full Notebook and to see more advance set of experiments check out the repo.  Hope you like that.

Continue Reading

Share

Online Hard Example Mining on PyTorch

Online Hard Example Mining (OHEM) is a way to pick hard examples with reduced computation cost to improve your network performance on borderline cases which generalize to the general performance. It is mostly used for Object Detection. Suppose you like to train a car detector and you have positive (with car) and negative images (with no car). Now you like to train your network. In practice, you find yourself in many negatives as oppose to relatively much small positives. To this end, it is clever to pick a subset of negatives that are the most informative for your network. Hard Example Mining is the way to go to this.

In a detection problem, hard examples corresponds to false positive detection depicted here with red.

In general, to pick a subset of negatives, first you train your network for couple of iterations, then you run your network all along your negative instances then you pick the ones with the greater loss values. However, it is very computationally toilsome since you have possibly millions of images to process, and sub-optimal for your optimization since you freeze your network while picking your hard instances that are not all being used for the next couple of iterations. That is, you assume here all hard negatives you pick are useful for all the next iterations until the next selection. Which is an imperfect assumption especially for large datasets.

Okay, what Online means in this regard. OHEM solves these two aforementioned problems by performing hard example selection batch-wise. Given a batch sized K, it performs regular forward propagation and computes per instance losses. Then, it finds M<K hard examples in the batch with high loss values and it only back-propagates the loss computed over the  selected instances. Smart hah ? 🙂

It reduces computation by running hand to hand with your regular optimization cycle. It also unties the assumption of the foreseen usefulness by picking hard examples per iteration so thus we now really pick the hard examples for each iteration.

If you like to test yourself, here is PyTorch OHEM implementation that I offer you to use a bit of grain of salt.

Share

Paper Review: Self-Normalizing Neural Networks

One of the main problems of neural networks is to tame layer activations so that one is able to obtain stable gradients to learn faster without any confining factor. Batch Normalization shows us that keeping values with mean 0 and variance 1 seems to work things. However, albeit indisputable effectiveness of BN, it adds more layers and computations to your model that you’d not like to have in the best case.

ELU (Exponential Linear Unit) is a activation function aiming to tame neural networks on the fly by a slight modification of activation function. It keeps the positive values as it is and exponentially skew negative values.

ELU function.

    \[\alpha\]

is a constant you define.

 

ELU does its job good enough, if you like to evade the cost of Bath Normalization, however its effectiveness does not rely on a theoretical proof beside empirical satisfaction. And finding a good

    \[\alpha\]

is just a guess.

Self-Normalizing Neural Networks takes things to next level. In short, it describes a new activation function SELU (Scaled Exponential Linear Units), a new initialization scheme and a new dropout variant as a repercussion,

The main topic here is to keep network activation in a certain basin defined by a mean and a variance values. These can be any values of your choice but for the paper it is mean 0 and variance 1 (similar to notion of Batch Normalization). The question afterward is to modifying ELU function by some scaling factors to keep the activations with that mean and variance on the fly. They find these scaling values by a long theoretical justification. Stating that, scaling factors of ELU are supposed to be defined as such any passing value of ELU should be contracted to define mean and variance.  (This is just verbal definition by no means complete. Please refer to paper to be more into theory side. )

Above, the scaling factors are shown as

    \[\alpha\]

and

    \[\lambda\]

.  After long run of computations these values appears to be 1.6732632423543772848170429916717 and 1.0507009873554804934193349852946 relatively. Nevertheless, do not forget that these scaling factors are targeting specifically mean 0 and variance 1.  Any change preludes to change these values as well.

[gist]<script src=”https://gist.github.com/erogol/17e6475aa280a79c1cc7653845b58d96.js”></script>[/gist]

Initialization is also another important part of the whole method. The aim here is to start with the right values. They suggest to sample weights from a Gaussian distribution with mean 0 and variance

    \[1/n\]

where n is number of weights.

It is known with a well credence that Dropout does not play well with Batch Normalization since it smarting network activations in a purely random manner. This method seems even more brittle to dropout effect. As a cure, they propose Alpha Dropout. It randomly sets inputs to saturatied negative value of SELU which is

    \[-\alpha\lambda\]

. Then an affine transformation is applied to it with

    \[a\]

and

    \[b\]

values computed relative to dropout rate, targeted mean and variance.It randomizes network without degrading network properties.

[gist]<script src=”https://gist.github.com/erogol/9eb0ab46366642887afeb78b4b8724d3.js”></script>[/gist]

In a practical point of view, SELU seems promising by reducing the computation time relative to RELU+BN for normalizing the network. In the paper they does not provide any vision based baseline such a MNIST, CIFAR and they only pounce on Fully-Connected models. I am still curios to see its performance vis-a-vis on these benchmarks agains Bath Normalization. I plan to give it a shoot in near future.

One tickle in my mind after reading the paper is the obsession of mean 0 and variance 1 for not only this paper but also the other normalization techniques. In deed, these values are just relative so why 0 and 1 but not 0 and 4. If you have a answer to this please ping me below.

Share

Duplicate Question Detection with Deep Learning on Quora Dataset

Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not.  In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.

Beside the proposed method, it includes some examples showing how to use Pandas, Gensim, Spacy and Keras. For the full code you check Github.

Data Quirks

There are 255045 negative (non-duplicate) and 149306 positive (duplicate) instances. This induces a class imbalance however when you consider the nature of the problem, it seems reasonable to keep the same data bias with your ML model since negative instances are more expectable in a real-life scenario.

When we analyze the data, the shortest question is 1 character long (which is stupid and useless for the task) and the longest question is 1169 character (which is a long, complicated love affair question). I see that if any of the pairs is shorter than 10 characters, they do not make sense thus, I remove such pairs.  The average length is 59 and std is 32.

There are two other columns “q1id” and “q2id” but I really do not know how they are useful since the same question used in different rows has different ids.

Some labels are not true, especially for the duplicate ones. In anyways, I decided to rely on the labels and defer pruning due to hard manual effort.

Proposed Method

Converting Questions into Vectors

Here, I plan to use Word2Vec to convert each question into a semantic vector then I stack a Siamese network to detect if the pair is duplicate.

Word2Vec is a general term used for similar algorithms that embed words into a vector space with 300 dimensions in general.  These vectors capture semantics and even analogies between different words. The famous example is ;

king - man + woman = queen.

Word2Vec vectors can be used for may useful applications. You can compute semantic word similarity, classify documents or input these vectors to Recurrent Neural Networks for more advance applications.

There are two well-known algorithms in this domain. One is Google’s network architecture which learns representation by trying to predict surrounding words of a target word given certain window size. GLOVE is the another methos which relies on co-occurrence matrices. GLOVE is easy to train and it is flexible to add new words out-side of your vocabulary. You might like visit this tutorial to learn more and check this brilliant use-case Sense2Vec.

We still need a way to combine word vectors for singleton question representation. One simple alternative is taking the mean of all word vectors of each question. This is simple but really effective way for document classification and I expect it to work for this problem too.   In addition,  it is possible to enhance mean vector representation by using TF-IDF scores defined for each word. We apply weighted average of word vectors by using these scores. It emphasizes importance of discriminating words and avoid useless, frequent words which are shared by many questions.

Siamese Network

I described Siamese network in a previous post. In short, it is a two way network architecture which takes two inputs from the both side. It projects data into a space in which similar items are contracted and dissimilar ones are dispersed over the learned space. It is computationally efficient since networks are sharing parameters.

Siamese network tries to contract instances belonging to the same classes and disperse instances from different classes in the feature space.

 

Implementation

Let’s load the training data first.

For this particular problem, I train my own GLOVE model by using Gensim.

The above code trains a GLOVE model and saves it. It generates 300 dimensional vectors for words. Hyper parameters would be chosen better but it is just a baseline to see a initial performance. However, as I’ll show this model gives performance below than my expectation. I believe, this is because our questions are short and does not induce a semantic structure that GLOVE is able to learn a salient model.

Due to the performance issue and the observation above, I decide to use a pre-trained GLOVE model which comes free with Spacy. It is trained on Wikipedia and therefore, it is stronger in terms of word semantics. This is how we use Spacy for this purpose.

Before going further, I really like Spacy. It is really fast and it does everything you need for NLP in a flash of time by hiding many intrinsic details. It deserves a good remuneration.  Similar to Gensim model, it also provides 300 dimensional embedding vectors.

The result I get from Spacy vectors is above Gensim model I trained. It is a better choice to go further with TF-IDF scoring.  For TF-IDF, I used scikit-learn (heaven of ML).  It provides TfIdfVectorizer which does everything you need.

After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores. The below code does this for just “question1” column.

Now, we are ready to create training data for Siamese network. Basically, I’ve just fetch the labels and covert mean word2vec vectors to numpy format. I split the data into train and test set too.

In this stage, we need to define Siamese network structure. I use Keras for its simplicity. Below, it is the whole script that I used for the definition of the model.

I share here the best performing network with residual connections. It is a 3 layers network using Euclidean distance as the measure of instance similarity. It has Batch Normalization per layer. It is particularly important since BN layers enhance the performance considerably. I believe, they are able to normalize the final feature vectors and Euclidean distance performances better in this normalized space.

I tried Cosine distance which is more concordant to Word2Vec vectors theoretically but cannot handle to obtain better results. I also tried to normalize data into unit variance or L2 norm but nothing gives better results than the original feature values.

Let’s train the network with the prepared data. I used the same model and hyper-parameters for all configurations. It is always possible to optimize these but hitherto I am able to give promising baseline results.

Results

In this section, I like to share test set accuracy values obtained by different model and feature extraction settings.  We expect to see improvement over 0.63 since when we set all the labels as 0, it is the accuracy we get.

These are the best results I obtain with varying GLOVE models. they all use the same network and hyper-parameters after I find the best on the last configuration depicted below.

  • Gensim (my model) + Siamese: 0.69
  • Spacy + Siamese :  0.72
  • Spacy + TD-IDF + Siamese : 0.79

We can also investigate the effect of different model architectures.  These are the values following  the best word2vec model shown above.

  • 2 layers net : 0.67
  • 3 layers net + adam : 0.74
  • 3 layers resnet (after relu BN) + adam : 0.77
  • 3 layers resnet (before relu BN) + adam : 0.78
  • 3 layers resnet (before relu BN) + adam + dropout : 0.75
  • 3 layers resnet (before relu BN) + adam + layer concat : 0.79
  • 3 layers resnet (before relu BN) + adam + unit_norm + cosine_distance : Fail

Adam works quite well for this problem compared to SGD with learning rate scheduling. Batch Normalization also yields a good improvement. I tried to introduce Dropout between layers in different orders (before ReLU, after BN etc.), the best I obtain is 0.75.  Concatenation of different layers improves the performance by 1 percent as the final gain.

In conclusion, here I tried to present a solution to this unique problem by composing different aspects of deep learning. We start with Word2Vec and combine it  with TF-IDF and then use Siamese network to find duplicates. Results are not perfect and akin to different optimizations. However, it is just a small try to see the power of deep learning in this domain. I hope you find it useful :).

Updates

  • Switching last layer to FC layer improves performance to 0.84.
  • By using bidirectional RNN and 1D convolutional layers together as feature extractors improves performance to 0.91. Maybe I’ll explain details with another post.
Share

Short guide to deploy Machine Learning

“Calling ML intricately simple 🙂 “

Suppose you have a problem that you like to tackle with machine learning and use the resulting system in a real-life project.  I like to share my simple pathway for such purpose, in order to provide a basic guide to beginners and keep these things as a reminder to myself. These rules are tricky since even-thought they are simple, it is not that trivial to remember all and suppress your instinct which likes to see a running model as soon as possible.

When we confronted any problem, initially we have numerous learning algorithms, many bytes or gigabytes of data and already established knowledge to apply some of these models to particular problems.  With all these in mind, we follow a three stages procedure;

  1. Define a goal based on a metric
  2. Build the system
  3. Refine the system with more data

Let’s pear down these steps into more details ; Continue Reading

Share

Paper review – Understanding Deep Learning Requires Rethinking Generalization

Paper: https://arxiv.org/pdf/1611.03530v1.pdf

This paper states the following phrase. Traditional machine learning frameworks (VC dimensions, Rademacher complexity etc.) trying to explain how learning occurs are not very explanatory for the success of deep learning models and we need more understanding looking from different perspectives.

They rely on following empirical observations;

  • Deep networks are able to learn any kind of train data even with white noise instances with random labels. It entails that neural networks have very good brute-force memorization capacity.
  • Explicit regularization techniques – dropout, weight decay, batch norm – improves model generalization but it does not mean that same network give poor generalization performance without any of these. For instance, an inception network trained without ant explicit technique has 80.38% top-5 rate where as the same network achieved 83.6% on ImageNet challange with explicit techniques.
  • A 2 layers network with 2n+d parameters can learn the function f with n samples in d dimensions. They provide a proof of this statement on appendix section. From the empirical stand-view, they show the network performances on MNIST and CIFAR-10 datasets with 2 layers Multi Layer Perceptron.

Above observations entails following questions and conflicts;

  • Traditional notion of learning suggests stronger regularization as we use more powerful models. However, large enough network model is able to memorize any kind of data even if this data is just a random noise. Also, without any further explicit regularization techniques these models are able to generalize well in natural datasets.  It shows us that, conflicting to general belief, brute-force memorization is still a good learning method yielding reasonable generalization performance in test time.
  • Classical approaches are poorly suited to explain the success of neural networks and more investigation is imperative in order to understand what is really going on from theoretical view.
  • Generalization power of the networks are not really defined by the explicit techniques, instead implicit factors like learning method or the model architecture seems more effective.
  • Explanation of generalization is need to be redefined in order to solve the conflicts depicted above.

My take :  These large models are able to learn any function (and large does not mean deep anymore) and if there is any kind of information match between the training data and the test data, they are able to generalize well as well. Maybe it might be an explanation to think this models as an ensemble of many millions of smaller models on which is controlled by the zeroing effect of activation functions.  Thus, it is able to memorize any function due to its size and implicated capacity but it still generalize well due-to this ensembling effect.

Share

Why do we need better word representations ?

A successful AI agent should communicate. It is all about language. It should understand and explain itself in words in order to communicate us.  All of these spark with the “meaning” of words which the atomic part of human-wise communication. This is one of the fundamental problems of Natural Language Processing (NLP).

“meaning” is described as “the idea that is represented by a word, phrase, etc. How about representing the meaning of a word in a computer. The first attempt is to use some kind of hardly curated taxonomies such as WordNet. However such hand made structures not flexible enough, need human labor to elaborate and  do not have semantic relations between words other then the carved rules. It is not what we expect from a real AI agent.

Then NLP research focused to use number vectors to symbolize words. The first use is to donate words with discrete (one-hot) representations. That is, if we assume a vocabulary with 1K words then we create a 1K length 0 vector with only one 1 representing the target word. Continue Reading

Share

Pages:12345678