artificial intelligence machine learning reading room : 2024

Wednesday, June 12, 2024

how does large language model work (ai ml)

https://www.bing.com/search?q=how+does+large+language+model+work

Generative AI —

A jargon-free explanation of how AI large language models work

Want to really understand large language models? Here’s a gentle primer.

Timothy B. Lee and Sean Trott - 7/31/2023, 4:00 AM

https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/

[[ go the actual URL (link) to read the article because the accompanying figures embedded in the article could be helpful in your understanding, I copy & paste the TEXT for archive purpose and some other reasons ]]

Large language models, explained with a minimum of math and jargon

Want to really understand how large language models work? Here’s a gentle primer.

Timothy B Lee

and Sean Trott

Jul 27, 2023

https://www.understandingai.org/p/large-language-models-explained-with

Large language models, explained with a minimum of math and jargon

Want to really understand how large language models work? Here’s a gentle primer.

Timothy B Lee

and Sean Trott

Jul 27, 2023

Hi, it’s Tim Lee. I’m a journalist with a master’s degree in computer science. This post is the result of two months of in-depth research. If you find it helpful, please subscribe to get future articles delivered straight to your inbox.

Today’s post is co-authored with Sean Trott, a cognitive scientist at the University of California, San Diego. If you are interested in the intersection of cognitive science and AI, I recommend that you subscribe to his excellent Substack.

This article is also available in Spanish and Portuguese.

When ChatGPT was introduced last fall, it sent shockwaves through the technology industry and the larger world. Machine learning researchers had been experimenting with large language models (LLMs) for a few years by that point, but the general public had not been paying close attention and didn’t realize how powerful they had become.

Today almost everyone has heard about LLMs, and tens of millions of people have tried them out. But, still, not very many people understand how they work.

If you know anything about this subject, you’ve probably heard that LLMs are trained to “predict the next word,” and that they require huge amounts of text to do this. But that tends to be where the explanation stops. The details of how they predict the next word is often treated as a deep mystery.

One reason for this is the unusual way these systems were developed. Conventional software is created by human programmers who give computers explicit, step-by-step instructions. In contrast, ChatGPT is built on a neural network that was trained using billions of words of ordinary language.

As a result, no one on Earth fully understands the inner workings of LLMs. Researchers are working to gain a better understanding, but this is a slow process that will take years—perhaps decades—to complete.

Still, there’s a lot that experts do understand about how these systems work. The goal of this article is to make a lot of this knowledge accessible to a broad audience. We’ll aim to explain what’s known about the inner workings of these models without resorting to technical jargon or advanced math.

We’ll start by explaining word vectors, the surprising way language models represent and reason about language. Then we’ll dive deep into the transformer, the basic building block for systems like ChatGPT. Finally, we’ll explain how these models are trained and explore why good performance requires such phenomenally large quantities of data.

Word vectors

To understand how language models work, you first need to understand how they represent words. Human beings represent English words with a sequence of letters, like C-A-T for cat. Language models use a long list of numbers called a word vector. For example, here’s one way to represent cat as a vector:

[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]

(The full vector is 300 numbers long—to see it all click here and then click “show the raw vector.”)

Why use such a baroque notation? Here’s an analogy. Washington DC is located at 38.9 degrees North and 77 degrees West. We can represent this using a vector notation:

Washington DC is at [38.9, 77]

New York is at [40.7, 74]

London is at [51.5, 0.1]

Paris is at [48.9, -2.4]

This is useful for reasoning about spatial relationships. You can tell New York is close to Washington DC because 38.9 is close to 40.7 and 77 is close to 74. By the same token, Paris is close to London. But Paris is far from Washington DC.

Language models take a similar approach: each word vector1 represents a point in an imaginary “word space,” and words with more similar meanings are placed closer together. For example, the words closest to cat in vector space include dog, kitten, and pet. A key advantage of representing words with vectors of real numbers (as opposed to a string of letters, like “C-A-T”) is that numbers enable operations that letters don’t.

Words are too complex to represent in only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. The human mind can’t envision a space with that many dimensions, but computers are perfectly capable of reasoning about them and producing useful results.

Researchers have been experimenting with word vectors for decades, but the concept really took off when Google announced its word2vec project in 2013. Google analyzed millions of documents harvested from Google News to figure out which words tend to appear in similar sentences. Over time, a neural network trained to predict which words co-occur with which other words learned to place similar words (like dog and cat) close together in vector space.

Google’s word vectors had another intriguing property: you could “reason” about words using vector arithmetic. For example, Google researchers took the vector for biggest, subtracted big, and added small. The word closest to the resulting vector was smallest.

You can use vector arithmetic to draw analogies! In this case big is to biggest as small is to smallest. Google’s word vectors captured a lot of other relationships:

Swiss is to Switzerland as Cambodian is to Cambodia. (nationalities)

Paris is to France as Berlin is to Germany. (capitals)

Unethical is to ethical as possibly is to impossibly. (opposites)

Mouse is to mice as dollar is to dollars. (plurals)

Man is to woman as king is to queen. (gender roles)

Because these vectors are built from the way humans use words, they end up reflecting many of the biases that are present in human language. For example, in some word vector models, doctor minus man plus woman yields nurse. Mitigating biases like this is an area of active research.

Nevertheless, word vectors are a useful building block for language models because they encode subtle but important information about the relationships between words. If a language model learns something about a cat (for example: it sometimes goes to the vet), the same thing is likely to be true of a kitten or a dog. If a model learns something about the relationship between Paris and France (for example: they share a language) there’s a good chance that the same will be true for Berlin and Germany and for Rome and Italy.

Word meaning depends on context

A simple word vector scheme like this doesn’t capture an important fact about natural language: words often have multiple meanings.

For example, the word bank can refer to a financial institution or to the land next to a river. Or consider the following sentences:

John picks up a magazine.

Susan works for a magazine.

The meanings of magazine in these sentences are related but subtly different. John picks up a physical magazine, while Susan works for an organization that publishes physical magazines.

When a word has two unrelated meanings, as with bank, linguists call them homonyms. When a word has two closely related meanings, as with magazine, linguists call it polysemy.

LLMs like ChatGPT are able to represent the same word with different vectors depending on the context in which that word appears. There’s a vector for bank (financial institution) and a different vector for bank (of a river). There’s a vector for magazine (physical publication) and another for magazine (organization). As you might expect, LLMs use more similar vectors for polysemous meanings than for homonymous meanings.

So far we haven’t said anything about how language models do this—we’ll get into that shortly. But we’re belaboring these vector representations because it’s fundamental to understanding how language models work.

Traditional software is designed to operate on data that’s unambiguous. If you ask a computer to compute “2 + 3,” there’s no ambiguity about what 2, +, or 3 mean. But natural language is full of ambiguities that go beyond homonyms and polysemy:

In “the customer asked the mechanic to fix his car” does his refer to the customer or the mechanic?

In “the professor urged the student to do her homework” does her refer to the professor or the student?

In “fruit flies like a banana” is flies a verb (referring to fruit soaring across the sky) or a noun (referring to banana-loving insects)?

People resolve ambiguities like this based on context, but there are no simple or deterministic rules for doing this. Rather, it requires understanding facts about the world. You need to know that mechanics typically fix customers’ cars, that students typically do their own homework, and that fruit typically doesn’t fly.

Word vectors provide a flexible way for language models to represent each word’s precise meaning in the context of a particular passage. Now let’s look at how they do that.

Transforming word vectors into word predictions

GPT-3, the model behind the original version of ChatGPT2, is organized into dozens of layers. Each layer takes a sequence of vectors as inputs—one vector for each word in the input text—and adds information to help clarify the meaning of that word and better predict which word might come next.

Let’s start by looking at a stylized example:

Each layer of an LLM is a transformer, a neural network architecture that was first introduced by Google in a landmark 2017 paper.

The model’s input, shown at the bottom of the diagram, is the partial sentence “John wants his bank to cash the.” These words, represented as word2vec-style vectors, are fed into the first transformer.

The transformer figures out that wants and cash are both verbs (both words can also be nouns). We’ve represented this added context as red text in parentheses, but in reality the model would store it by modifying the word vectors in ways that are difficult for humans to interpret. These new vectors, known as a hidden state, are passed to the next transformer in the stack.

The second transformer adds two other bits of context: it clarifies that bank refers to a financial institution rather than a river bank, and that his is a pronoun that refers to John. The second transformer produces another set of hidden state vectors that reflect everything the model has learned up to that point.

The above diagram depicts a purely hypothetical LLM, so don’t take the details too seriously. We’ll take a look at research into real language models shortly. Real LLMs tend to have a lot more than two layers. The most powerful version of GPT-3, for example, has 96 layers.

Research suggests that the first few layers focus on understanding the syntax of the sentence and resolving ambiguities like we’ve shown above. Later layers (which we’re not showing to keep the diagram a manageable size) work to develop a high-level understanding of the passage as a whole.

For example, as an LLM “reads through” a short story, it appears to keep track of a variety of information about the story’s characters: sex and age, relationships with other characters, past and current location, personalities and goals, and so forth.

Researchers don’t understand exactly how LLMs keep track of this information, but logically speaking the model must be doing it by modifying the hidden state vectors as they get passed from one layer to the next. It helps that in modern LLMs, these vectors are extremely large.

For example, the most powerful version of GPT-3 uses word vectors with 12,288 dimensions—that is, each word is represented by a list of 12,288 numbers. That’s 20 times larger than Google’s 2013 word2vec scheme. You can think of all those extra dimensions as a kind of “scratch space” that GPT-3 can use to write notes to itself about the context of each word. Notes made by earlier layers can be read and modified by later layers, allowing the model to gradually sharpen its understanding of the passage as a whole.

So suppose we changed our diagram above to depict a 96-layer language model interpreting a 1,000-word story. The 60th layer might include a vector for John with a parenthetical comment like “(main character, male, married to Cheryl, cousin of Donald, from Minnesota, currently in Boise, trying to find his missing wallet).” Again, all of these facts (and probably a lot more) would somehow be encoded as a list of 12,288 numbers corresponding to the word John. Or perhaps some of this information might be encoded in the 12,288-dimensional vectors for Cheryl, Donald, Boise, wallet, or other words in the story.

The goal is for the 96th and final layer of the network to output a hidden state for the final word that includes all of the information necessary to predict the next word.

Can I have your attention please

Now let’s talk about what happens inside each transformer. The transformer has a two-step process for updating the hidden state for each word of the input passage:

In the attention step, words “look around” for other words that have relevant context and share information with one another.

In the feed-forward step, each word “thinks about” information gathered in previous attention steps and tries to predict the next word.

Of course it’s the network, not the individual words, that performs these steps. But we’re phrasing things this way to emphasize that transformers treat words, rather than entire sentences or passages, as the basic unit of analysis. This approach enables LLMs to take full advantage of the massive parallel processing power of modern GPU chips. And it also helps LLMs to scale to passages with thousands of words. These are both areas where earlier language models struggled.

You can think of the attention mechanism as a matchmaking service for words. Each word makes a checklist (called a query vector) describing the characteristics of words it is looking for. Each word also makes a checklist (called a key vector) describing its own characteristics. The network compares each key vector to each query vector (by computing a dot product) to find the words that are the best match. Once it finds a match, it transfers information from the word that produced the key vector to the word that produced the query vector.

For example, in the previous section we showed a hypothetical transformer figuring out that in the partial sentence “John wants his bank to cash the,” his refers to John. Here’s what that might look like under the hood. The query vector for his might effectively say “I’m seeking: a noun describing a male person.” The key vector for John might effectively say “I am: a noun describing a male person.” The network would detect that these two vectors match and move information about the vector for John into the vector for his.

Each attention layer has several “attention heads,” which means that this information-swapping process happens several times (in parallel) at each layer. Each attention head focuses on a different task:

One attention head might match pronouns with nouns, as we discussed above.

Another attention head might work on resolving the meaning of homonyms like bank.

A third attention head might link together two-word phrases like “Joe Biden.”

And so forth.

Attention heads frequently operate in sequence, with the results of an attention operation in one layer becoming an input for an attention head in a subsequent layer. Indeed, each of the tasks we just listed above could easily require several attention heads rather than just one.

The largest version of GPT-3 has 96 layers with 96 attention heads each, so GPT-3 performs 9,216 attention operations each time it predicts a new word.

A real-world example

In the last two sections we presented a stylized version of how attention heads work. Now let’s look at research on the inner workings of a real language model. Last year scientists at Redwood Research studied how GPT-2, a predecessor to ChatGPT, predicted the next word for the passage “When Mary and John went to the store, John gave a drink to.”

GPT-2 predicted that the next word was Mary. The researchers found that three types of attention heads contributed to this prediction:

Three heads they called Name Mover Heads copied information from the Mary vector to the final input vector (for the word to). GPT-2 uses the information in this rightmost vector to predict the next word.

How did the network decide Mary was the right word to copy? Working backwards through GPT-2’s computational process, the scientists found a group of four attention heads they called Subject Inhibition Heads that marked the second John vector in a way that blocked the Name Mover Heads from copying the name John.

How did the Subject Inhibition Heads know John shouldn’t be copied? Working further backwards, the team found two attention heads they called Duplicate Token Heads. They marked the second John vector as a duplicate of the first John vector, which helped the Subject Inhibition Heads to decide that John shouldn’t be copied.

In short, these nine attention heads enabled GPT-2 to figure out that “John gave a drink to John” doesn’t make sense and choose “John gave a drink to Mary” instead.

We love this example because it illustrates just how difficult it will be to fully understand LLMs. The five-member Redwood team published a 25-page paper explaining how they identified and validated these attention heads. Yet even after they did all that work, we are still far from having a comprehensive explanation for why GPT-2 decided to predict Mary as the next word.

For example, how did the model know the next word should be someone’s name and not some other kind of word? It’s easy to think of similar sentences where Mary wouldn’t be a good next-word prediction. For example, in the sentence “when Mary and John went to the restaurant, John gave his keys to,” the logical next words would be “the valet.”

Presumably, with enough research computer scientists could uncover and explain additional steps in GPT-2’s reasoning process. Eventually, they might be able to develop a comprehensive understanding of how GPT-2 decided that Mary is the most likely next word for this sentence. But it could take months or even years of additional effort just to understand the prediction of a single word.

The language models underlying ChatGPT—GPT-3.5 and GPT-4—are significantly larger and more complex than GPT-2. They are capable of more complex reasoning than the simple sentence-completion task the Redwood team studied. So fully explaining how these systems work is going to be a huge project that humanity is unlikely to complete any time soon.

The feed-forward step

After the attention heads transfer information between word vectors, there’s a feed-forward network3 that “thinks about” each word vector and tries to predict the next word. No information is exchanged between words at this stage: the feed-forward layer analyzes each word in isolation. However, the feed-forward layer does have access to any information that was previously copied by an attention head. Here’s the structure of the feed-forward layer in the largest version of GPT-3:

The green and purple circles are neurons: mathematical functions that compute a weighted sum of their inputs.4

What makes the feed-forward layer powerful is its huge number of connections. We’ve drawn this network with three neurons in the output layer and six neurons in the hidden layer, but the feed-forward layers of GPT-3 are much larger: 12,288 neurons in the output layer (corresponding to the model’s 12,288-dimensional word vectors) and 49,152 neurons in the hidden layer.

So in the largest version of GPT-3, there are 49,152 neurons in the hidden layer with 12,288 inputs (and hence 12,288 weight parameters) for each neuron. And there are 12,288 output neurons with 49,152 input values (and hence 49,152 weight parameters) for each neuron. This means that each feed-forward layer has 49,152 * 12,288 + 12,288 * 49,152 = 1.2 billion weight parameters. And there are 96 feed-forward layers, for a total of 1.2 billion * 96 = 116 billion parameters! This accounts for almost two-thirds of GPT-3’s overall total of 175 billion parameters.

In a 2020 paper, researchers from Tel Aviv University found that feed-forward layers work by pattern matching: each neuron in the hidden layer matches a specific pattern in the input text. Here are some of the patterns that were matched by neurons in a 16-layer version of GPT-2:

A neuron in layer 1 matched sequences of words ending with “substitutes.”

A neuron in layer 6 matched sequences related to the military and ending with “base” or “bases.”

A neuron in layer 13 matched sequences ending with a time range such as “between 3 pm and 7” or “from 7:00 pm Friday until.”

A neuron in layer 16 matched sequences related to television shows such as “the original NBC daytime version, archived” or “time shifting viewing added 57 percent to the episode’s.”

As you can see, patterns got more abstract in the later layers. The early layers tended to match specific words, whereas later layers matched phrases that fell into broader semantic categories such as television shows or time intervals.

This is interesting because, as mentioned previously, the feed-forward layer examines only one word at a time. So when it classifies the sequence “the original NBC daytime version, archived” as related to television, it only has access to the vector for archived, not words like NBC or daytime. Presumably, the feed-forward layer can tell that archived is part of a television-related sequence because attention heads previously moved contextual information into the archived vector.

When a neuron matches one of these patterns, it adds information to the word vector. While this information isn’t always easy to interpret, in many cases you can think of it as a tentative prediction about the next word.

Feed-forward networks reason with vector math

Recent research from Brown University revealed an elegant example of how feed-forward layers help to predict the next word. Earlier we discussed Google’s word2vec research showing it was possible to use vector arithmetic to reason by analogy. For example, Berlin - Germany + France = Paris.

The Brown researchers found that feed-forward layers sometimes use this exact method to predict the next word. For example, they examined how GPT-2 responded to the following prompt: “Q: What is the capital of France? A: Paris Q: What is the capital of Poland? A:”

The team studied a version of GPT-2 with 24 layers. After each layer, the Brown scientists probed the model to observe its best guess at the next token. For the first 15 layers, the top guess was a seemingly random word. Between the 16th and 19th layer, the model started predicting that the next word would be Poland—not correct, but getting warmer. Then at the 20th layer, the top guess changed to Warsaw—the correct answer—and stayed that way in the last four layers.

The Brown researchers found that the 20th feed-forward layer converted Poland to Warsaw by adding a vector that maps country vectors to their corresponding capitals. Adding the same vector to China produced Beijing.

Feed-forward layers in the same model used vector arithmetic to transform lower-case words into upper-case words and present-tense words into their past-tense equivalents.

The attention and feed-forward layers have different jobs

So far we’ve looked at two real-world examples of GPT-2 word predictions: attention heads helping to predict that John gave a drink to Mary, and a feed-forward layer helping to predict that Warsaw was the capital of Poland.

In the first case, Mary came from the user-provided prompt. But in the second case, Warsaw wasn’t in the prompt. Rather GPT-2 had to “remember” the fact that Warsaw was the capital of Poland—information it learned from training data.

When the Brown researchers disabled the feed-forward layer that converted Poland to Warsaw, the model no longer predicted Warsaw as the next word. But interestingly, if they then added the sentence “The capital of Poland is Warsaw” to the beginning of the prompt, then GPT-2 could answer the question again. This is probably because GPT-2 used attention heads to copy the name Warsaw from earlier in the prompt.

This division of labor holds more generally: attention heads retrieve information from earlier words in a prompt, whereas feed-forward layers enable language models to “remember” information that’s not in the prompt.

Indeed, one way to think about the feed-forward layers is as a database of information the model has learned from its training data. The earlier feed-forward layers are more likely to encode simple facts related to specific words, such as “Trump often comes after Donald.” Later layers encode more complex relationships like “add this vector to convert a country to its capital.”

How language models are trained

Many early machine learning algorithms required training examples to be hand-labeled by human beings. For example, training data might have been photos of dogs or cats with a human-supplied label (“dog” or “cat”) for each photo. The need for humans to label data made it difficult and expensive to create large enough data sets to train powerful models.

A key innovation of LLMs is that they don’t need explicitly labeled data. Instead, they learn by trying to predict the next word in ordinary passages of text. Almost any written material—from Wikipedia pages to news articles to computer code—is suitable for training these models.

For example, an LLM might be given the input “I like my coffee with cream and” and be supposed to predict “sugar” as the next word. A newly-initialized language model will be really bad at this because each of its weight parameters—175 billion of them in the most powerful version of GPT-3—will start off as an essentially random number.

But as the model sees many more examples—hundreds of billions of words—those weights are gradually adjusted to make better and better predictions.

Here’s an analogy to illustrate how this works. Suppose you’re going to take a shower, and you want the temperature to be just right: not too hot, and not too cold. You’ve never used this faucet before, so you point the knob to a random direction and feel the temperature of the water. If it’s too hot, you turn it one way; if it’s too cold, you turn it the other way. The closer you get to the right temperature, the smaller the adjustments you make.

Now let’s make a couple of changes to the analogy. First, imagine that there are 50,257 faucets instead of just one. Each faucet corresponds to a different word like the, cat, or bank. Your goal is to have water only come out of the faucet corresponding to the next word in a sequence.

Second, there’s a maze of interconnected pipes behind the faucets, and these pipes have a bunch of valves on them as well. So if water comes out of the wrong faucet, you don’t just adjust the knob at the faucet. You dispatch an army of intelligent squirrels to trace each pipe backwards and adjust each valve they find along the way.

This gets complicated because the same pipe often feeds into multiple faucets. So it takes careful thought to figure out which valves to tighten and which ones to loosen, and by how much.

Obviously, this example quickly gets silly if you take it too literally. It wouldn’t be realistic or useful to build a network of pipes with 175 billion valves. But thanks to Moore’s Law, computers can and do operate at this kind of scale.

All the parts of LLMs we’ve discussed in this article so far—the neurons in the feed-forward layers and the attention heads that move contextual information between words—are implemented as a chain of simple mathematical functions (mostly matrix multiplications) whose behavior is determined by adjustable weight parameters. Just as the squirrels in my story loosen and tighten the valves to control the flow of water, so the training algorithm increases or decreases the language model’s weight parameters to control how information flows through the neural network.

The training process happens in two steps. First there’s a “forward pass,” where the water is turned on and you check if it comes out the right faucet. Then the water is turned off and there’s a “backwards pass” where the squirrels race along each pipe tightening and loosening valves. In digital neural networks, the role of the squirrels is played by an algorithm called backpropagation, which “walks backwards” through the network, using calculus to estimate how much to change each weight parameter.5

Completing this process—doing a forward pass with one example and then a backwards pass to improve the network’s performance on that example—requires hundreds of billions of mathematical operations. And training a model as big as GPT-3 requires repeating the process billions of times—once for each word of training data.6 OpenAI estimates that it took more than 300 billion trillion floating point calculations to train GPT-3—that’s months of work for dozens of high-end computer chips.

The surprising performance of GPT-3

You might find it surprising that the training process works as well as it does. ChatGPT can perform all sorts of complex tasks—composing essays, drawing analogies, and even writing computer code. So how does such a simple learning mechanism produce such a powerful model?

One reason is scale. It’s hard to overstate the sheer number of examples that a model like GPT-3 sees. GPT-3 was trained on a corpus of approximately 500 billion words. For comparison a typical human child encounters roughly 100 million words by age 10.

Over the last five years, OpenAI has steadily increased the size of its language models. In a widely-read 2020 paper, OpenAI reported that the accuracy of its language models scaled “as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.”

The larger their models got, the better they were at tasks involving language. But this was only true if they increased the amount of training data by a similar factor. And to train larger models on more data, you need a lot more computing power.

OpenAI’s first LLM, GPT-1, was released in 2018. It used 768-dimensional word vectors and had 12 layers for a total of 117 million parameters. A few months later, OpenAI released GPT-2. Its largest version had 1,600-dimensional word vectors, 48 layers, and a total of 1.5 billion parameters.

In 2020, OpenAI released GPT-3, which featured 12,288-dimensional word vectors and 96 layers for a total of 175 billion parameters.

Finally, this year OpenAI released GPT-4. The company has not published any architectural details, but GPT-4 is widely believed to be significantly larger than GPT-3.

Each model not only learned more facts than its smaller predecessors, it also performed better on tasks requiring some form of abstract reasoning:

For example, consider the following story:

Here is a bag filled with popcorn. There is no chocolate in the bag. Yet, the label on the bag says “chocolate” and not “popcorn.” Sam finds the bag. She had never seen the bag before. She cannot see what is inside the bag. She reads the label.

You can probably guess that Sam believes the bag contains chocolate and will be surprised to discover popcorn inside. Psychologists call this capacity to reason about the mental states of other people “theory of mind.” Most people have this capacity from the time they’re in grade school. Experts disagree about whether any non-human animals (like chimpanzees) have theory of mind, but there’s general consensus that it is important for human social cognition.

Earlier this year, Stanford psychologist Michal Kosinski published research examining the ability of LLMs to solve theory-of-mind tasks. He gave various language models passages like the one we quoted above and then asked them to complete a sentence like “she believes that the bag is full of.” The correct answer is “chocolate,” but an unsophisticated language model might say “popcorn” or something else.

GPT-1 and GPT-2 flunked this test. But the first version of GPT-3, released in 2020, got it right almost 40 percent of the time—a level of performance Kosinski compares to a three-year-old. The latest version of GPT-3, released last November, improved this to around 90 percent—on par with a seven-year-old. GPT-4 answered about 95 percent of theory-of-mind questions correctly.

“Given that there is neither an indication that ToM-like ability was deliberately engineered into these models, nor research demonstrating that scientists know how to achieve that, ToM-like ability likely emerged spontaneously and autonomously, as a byproduct of models’ increasing language ability,” Kosinski wrote.

It’s worth noting that researchers don’t all agree that these results indicate evidence of Theory of Mind: for example, small changes to the false-belief task led to much worse performance by GPT-3; and GPT-3 exhibits more variable performance across other tasks measuring theory of mind. As one of us (Sean) has written, it could be that successful performance is attributable to confounds in the task—a kind of “clever Hans” effect, only in language models rather than horses.

Nonetheless, the near-human performance of GPT-3 on several tasks designed to measure theory of mind would have been unthinkable just a few years ago—and is consistent with the idea that bigger models are generally better at tasks requiring high-level reasoning.

This is just one of many examples of language models appearing to spontaneously develop high-level reasoning capabilities. In April, researchers at Microsoft published a paper arguing that GPT-4 showed early, tantalizing hints of artificial general intelligence—the ability to think in a sophisticated, human-like way.

For example, one researcher asked GPT-4 to draw a unicorn using an obscure graphics programming language called TiKZ. GPT-4 responded with a few lines of code that the researcher then fed into the TiKZ software. The resulting images were crude, but they showed clear signs that GPT-4 had some understanding of what unicorns look like.

The researchers thought GPT-4 might have somehow memorized code for drawing a unicorn from its training data, so they gave it a follow-up challenge: they altered the unicorn code to remove the horn and move some of the other body parts. Then they asked GPT-4 to put the horn back on. GPT-4 responded by putting the horn in the right spot:

GPT-4 was able to do this even though the training data for the version tested by the authors was entirely text-based. That is, there were no images in its training set. But GPT-4 apparently learned to reason about the shape of a unicorn’s body after training on a huge amount of written text.

At the moment, we don’t have any real insight into how LLMs accomplish feats like this. Some people argue that examples like this demonstrate that the models are starting to truly understand the meanings of the words in their training set. Others insist that language models are “stochastic parrots” that merely repeat increasingly complex word sequences without truly understanding them.

This debate points to a deep philosophical tension that may be impossible to resolve. Nonetheless, we think it is important to focus on the empirical performance of models like GPT-3. If a language model is able to consistently get the right answer for a particular type of question, and if researchers are confident that they have controlled for confounds (e.g., ensuring that the language model was not exposed to those questions during training), then that is an interesting and important result whether or not it understands language in exactly the same sense that people do.

Another possible reason that training with next-token prediction works so well is that language itself is predictable. Regularities in language are often (though not always) connected to regularities in the physical world. So when a language model learns about relationships among words, it’s often implicitly learning about relationships in the world too.

Further, prediction may be foundational to biological intelligence as well as artificial intelligence. In the view of philosophers like Andy Clark, the human brain can be thought of as a “prediction machine”, whose primary job is to make predictions about our environment that can then be used to navigate that environment successfully. Intuitively, making good predictions benefits from good representations—you’re more likely to navigate successfully with an accurate map than an inaccurate one. The world is big and complex, and making predictions helps organisms efficiently orient and adapt to that complexity.

Traditionally, a major challenge for building language models was figuring out the most useful way of representing different words—especially because the meanings of many words depend heavily on context. The next-word prediction approach allows researchers to sidestep this thorny theoretical puzzle by turning it into an empirical problem. It turns out that if we provide enough data and computing power, language models end up learning a lot about how human language works simply by figuring out how to best predict the next word. The downside is that we wind up with systems whose inner workings we don’t fully understand.

Tim Lee was on staff at Ars from 2017 to 2021. He recently launched a new newsletter, Understanding AI. It explores how AI works and how it's changing our world. You can subscribe to his newsletter here.

Sean Trott is an Assistant Professor at University of California, San Diego, where he conducts research on language understanding in humans and large language models. He writes about these topics, and others, in his newsletter The Counterfactual.

Thanks for reading Understanding AI! Subscribe for free to receive new posts and support my work.

Technically, LLMs operate on fragments of words called tokens, but we’re going to ignore this implementation detail to keep the article to a manageable length.

Technically, the original version of ChatGPT is based on GPT-3.5, a successor to GPT-3 that underwent a process called Reinforcement Learning with Human Feedback (RLHF). OpenAI hasn't released all the architectural details for this model, so in this piece we'll focus on GPT-3, the last version that OpenAI has described in detail.

The feed-forward network is also known as a multilayer perceptron. Computer scientists have been experimenting with this type of neural network since the 1960s.

Technically, after a neuron computes a weighted sum of its inputs, it passes the result to an activation function. We’re going to ignore this implementation detail, but you can read Tim’s 2018 explainer if you want a full explanation of how neurons work.

If you want to learn more about backpropagation, check out Tim’s 2018 explainer on how neural networks work.

In practice, training is often done in batches for the sake of computational efficiency. So the software might do the forward pass on 32,000 tokens before doing a backward pass.

[[ the above article is much much better that the wikipedia page on large language model ]]

https://en.wikipedia.org/wiki/Large_language_model

ai - artificial intelligence (machine learning)

[[ the following TEXT mixed has been moved and duplicated from

models: machine learning (AI - artificial intelligence) (modelsreadingroom.blogspot.com)

https://modelsreadingroom.blogspot.com/2022/02/machine-learning-ai-artificial.html

]]

Kai-Fu Lee., AI superpowers: China, Silicon Valley and the new world order, 2018

pp.6-10

A brief history of deep learning

Machine learning ── the umbrella term for the field that includes deep learning ── is a history-altering technology but that is lucky to have survived a tumultuous half-century of research. Ever since its inception, artificial intelligence has undergone a number of boom-and-bust cycles. Periods of great promise have been followed by “AI winters”, when a disappointing lack of practical results led to major cut in funding. Understanding what makes the arrival of deep learning different requires a quick recap of how we got here.

Back in the mid-1950s, the pioneers of artificial intelligence set themselves an impossibly lofty but well-defined mission: to recreate human intelligence in a machine. That stiking combination of the clarity of the goal and the complexity of the task would draw in some of the greatest minds in the emerging field of computer science: Marvin Minsky, John McCarthy, and Herbert Simon.

As a wide-eyes computer science undergrad at Columbia University in the early 1980s, all of this seized my imagination. I was born in Taiwan in the early 1960s but moved to Tennessee at the age of 11 and finished middle and high school there. After four years at Columbia in New York, I knew that I wanted to dig deeper into AI. When applying for computer science Ph.D. programs in 1983, I even wrote this somewhat grandiose description of the field in my statement of purpose: “Artificial intelligence is the elucidation of the human learning process, the quantification of the human thinking process, the explication of human behavior, and the understanding of what makes intelligence possible. It is men's final step to understand themselves, and I hope to take part in this new, but promising science.”

That essay helped me get into the top-ranked computer science department of Carnegie Mellon University, a hotbed for cutting-edge AI research. It also displayed my naivaté about the field, both over estimating our power to understand ourselves and underestimating the power of AI to produce superhuman intelligence in narrow spheres.

By the time I began my Ph.D., the field of artificial intelligence had forked into two camps: the “rule-based” approach and the “neural networks” approach. Researchers in the rule-based camp (also sometimes called “symbolic systems” or “expert systems”) attempted to teach computers to think by encoding a series of logical rules: If X, then Y. This appproach worked well for simple and well-defined games (“toy problems”) but fell apart when the universe of possible choices or moves expanded. To make the software more applicable to real-world problems, the rule-based camp tried interviewing experts in the problems being tackled and then encoding their wisdom into the program's decision-making (hence the “expert systems” moniker).

The “neural networks” camp, however, took a different approach. Instead of trying to teach the computer the rules that had been mastered by a human brain, these practitioners tried to reconstruct the human brain itself. Given that the tangled webs of neurons in animal brains were the only thing capable of intelligence as we knew it, the researchers figured they'd go straight to the source. This approach mimics the brain's underlying architecture, constructing layers of artificial neurons that can receive and transmit information in a structure akin to our networks of biological neurons. Unlike the rule-based approach, builders of neural networks generally do not give the networks rules to follow in making decisions. They simply feed lots and lots of examples of a given phenomenon ── pictures, chess games, sounds ── into the neural networks and let the networks themselves identify patterns within the data. In other words, the less human interference, the better.

Differences between the two approaches can be seen in how they might approach a simple problem, identifying whether there is a cat in a picture. The rule-based approach would attempt to lay down “if-then” rules to help the program make a decision: “If there are two triangular shapes on top of a circular shape, then there is probably a cat in the picture”. The neural network approach would instead feed the program millions of sample photos labeled “cat” or “no cat”, letting the program figure out for itself what features in the millions of images were most closely correlated to the “cat” label.

During the 1950s and 1960s, early versions of artificial neural networks yielded promising results and plenty of hype. But then in 1969, researchers from the rule-based camp pushed back, convincing many in the field that neural networks were unreliable and limited in their use. The neural networks approach quickly went out of fashion, and AI plunged into one of its first “winters” during the 1970s.

Over the subsequent decades, neural networks enjoyed brief stints of prominence, followed by near-total abandonement. In 1988, I used a technique akin to neural networks (Hidden Markov Models) to create Sphinx, the world's first speaker-independent program for recognizing continuous speech. That achievement landed me a profile in the New York Times. But it wasn't enough to save neural networks from once again falling out of favor, as AI reentered a prolonged ice age for most of the 1990s.

What ultimately resuscitated the field of neural networks ── and sparked the AI renaissance we are living through today ── were changes to two of the key raw ingredients that neural networks feed on, along with one major technical breakthrough. Neural networks require large amounts of two things: computing power and data. The data “trains” the program to recognize patterns by giving it many examples, and the computing power lets the program parse those examples at high speeds.

Both data and computing power were in short supply at the dawn of the field in the 1950s. But in the intervening decades, all that has changed. Today, your smartphone holds millions of times more processing power than the leading cutting-edge computers that NASA used to send Neil Armstrong to the moon in 1969. And the internet has led to an explosion of all kinds of digital data: text, images, videos, clicks, purchases, Tweets, and so on. Taken together, all of this has given researchers copious amounts of rich data on which to train their networks, as well as plenty of cheap computing power for that training.

But the networks themselves were still severely limited in what they could do. Accurate results to complex problems required many layers of artificial neurons, but researchers hadn't found a way to efficiently train those layers as they were added. Deep learning's big technical break finally arrived in the mid-2000s, when leading researcher Geoffrey Hinton discovered a way to efficiently train those new layers in neural networks. The result was like giving steroids to the old neural networks, multiplying their power to perform tasks such as speech and object recognition.

Soon, these juiced-up neural networks ── new rebranded as “deep learning” ── could outperform older models at a variety of tasks. But years of ingrained prejudice against the neural networks approach led many AI researchers to overlook this “fringe” group that claimed outstanding results. The turning point came in 2012, when a neural network built by Hinton's team demolished the competition in a international computer vision contest.

After decades spent on the margins of AI research, neural networks hit the mainstream overnight, this time in the form of deep learning. That breakthrough promised to thaw the ice from the latest AI winter, and for the first time truly bring AI's power to bear on a range of real-world problems. Researchers, futurists, and tech CEOs all began buzzing about the massive potential of the field to decipher human speech, translate documents, recognize images, predict consumer behavior, identifying fraud, make lending decisions, help robot “see”, and even drive a car.

p.10

So how does deep learning do this? Fundamentally, these algorithms use massive amounts of data from a specific domain to make a decision that optimizes for a desired outcome. It does this by training itself to recognize deeply buried patterns and correlations connecting that many data points to the desired outcome. This pattern-finding process is easier when the data is labeled with that desired outcome ─ “cat” versus “no cat”; “clicked” versus “didn't click”; “won game” versus “lost game”. It can then draw on its extensive knowledge of these correlations ─ many of which are invisible or irrelevant to hman observers ─ to make better decisions than a human could.

Doing this requires massive amounts of relevant data, a strong algorithm, a narrow domain, and a concrete goal. If you're short any one of these, things fall apart. Too little data? The algorithm doesn't have enough examples to uncover meaningful correlations. Too broad a goal? The algorithm lacks clear benchmarks to shoot for in optimization.

Deep learning is what's known as “narrow AI” ─ intelligence that takes data from one specific domain and applies it to optimizing one specific outcome. While impressive, it is still a far cry from “general AI”, the all-purpose technology that can do everything a human can.

Deep learning's most natural application is in fields like insurance and making loans. Relevant data on borrowers is abundant (credit score, income, recent credit-card usage), and the goal to optimize for is clear (minimize default rates).

pp.10-11

Take one step further, deep learning will power self-driving cars by helping them to “see” the world around them ─ recognize patterns in the camera's pixels (red octagons), figure out what they correlate to (stop signs), and use that information to make decisions (apply pressure to the brake to slowly stop) that optimize for your desired outcome (deliver me safely home in minimal time).

p.11

deep learning

to recognize pattern,

optimize for a specific outcome,

make a decision

can be applied to so many different kinds of everyday problems.

p.11

People are so excited about deep learning precisely because its core power ─ its ability to recognize a pattern, optimize for a specific outcome, make a decision ─ can be applied to so many different kinds of everyday problems.

p.110

the fact that internet users are automatically labeling data as they browse.

p.110

traditional companies have also been automatically labeling huge quantities of data for decades. For instance, insurance companies have been covering accidents and catching fraud, banks have been issuing loans and documenting repayment rates, and hospitals have been keeping records of diagnoses and survival rates.

p.110

Business AI mines these databases for hidden correlations that often escape the naked eye and human brain.

p.110

historic decisions and outcomes within an organization and

uses labeled data to train an algorithm that can outperform even the most experienced human practitioners.

p.110

strong features

human normally make predictions on the basis of strong features, a handful of data points that are highly correlated to a specific outcome, often in a clear cause-and-effect relationship. For example, in predicting the likelihood of someone contracting diabetes, a person's weight and body mass index are strong features.

p.111

weak features

weak features: peripheral data points that might appear unrelated to the outcome but contain some predictive power when combined across tens of millinos of examples.

These subtle correlations are often impossible for any human to explain in terms of cause and effect: why do borrowers who take out loans on Wednesday repay those loans faster?

p.111

Optimizations like this work well in industries with large amounts of structured data on meaningful business outcomes. In this case, “structured” refers to data that has been categorized, labeled, and made searchable. Prime examples of well-structured corporate data sets include historic stock prices, credit-card usage, and mortgage defaults.

(AI superpowers: China, Silicon Valley and the new world order / Kai-Fu Lee.;

Boston: Houghton mifflin Harcourt, 2018; includes bibliographical references and index; subjects: artificial intelligence ── economic aspects ── china.| artificial intelligence ── economic aspects ── united states.; HC79.155 (ebook)

HC79.155 L435 2018 (print); 338.4; https://lccn.loc.gov/2018-17250; 2018, )

____________________________________

• I believed the technology [AI speech recognition] would go mainstream within five years.

• It turned out that I was off by twenty years.

Kai-Fu Lee., AI superpowers: China, Silicon Valley and the new world order, 2018

p.143

in the late 1980, I was the world's leading researcher on AI speech recognition, and I joined Apple because I believed the technology would go mainstream within five years. It turned out that I was off by twenty years.

p.178, p.177

chief scientist for speech recognition, 1991

we used voice commands to schedule an appointment, write a check, and program a VCR,

showcasing the earliest examples of futuristic functions that wouldn't go mainstream for another 20 years, with Apple's Siri and Amazon's Alexa.

(AI superpowers: China, Silicon Valley and the new world order / Kai-Fu Lee.; Boston: Houghton mifflin Harcourt, 2018; includes bibliographical references and index; subjects: artificial intelligence ── economic aspects ── china.| artificial intelligence ── economic aspects ── united states.; HC79.155 (ebook)

HC79.155 L435 2018 (print); 338.4; https://lccn.loc.gov/2018-17250; 2018, )

____________________________________

Stephen Witt., , The chosen chip : how nvidia is powering the A.I. revolution., The new yorker., Dec. 4, 2023

Jensen Huang, Nvidia's c.e.o.,

RIVA 128

GeForce

PC gamers, looking to gain an edge, brought new GeForce cards every time they were upgraded.

p.30

He founded Nvidia in 1993, with Chris Malachowsky and Curtis Priem, two veteran microchip designers.

p.30

Malachowsky and Priem were looking to design a graphics chip, which they hoped would make competitors, in Priem's words, “green with envy.”

‘’•─“”

p.31

In 2000, Ian Buck, a graduate student studying computer graphics at Stanford, chained 32 GeForce cards together to play Quake using 8 projects. “It was the first gaming rig in 8K resolution, and it took up an entire wall,” Buck told me. “It was beautiful.”

p.31

Buck wondered if the GeForce cards might be useful for tasks other than launching grenades at his friends. The cards came with a primitive programming tool called a shader. With a grant from DARPA, the Department of Defense's research arm, Buck hacked the shaders to access the parallel-computing circuits below, repurposing the GeForce into a low-budget supercomputer. Soon, Buck was working for Huang.

p.32

Since 2004, Buck has overseen the development of Nvidia's supercomputing software package, known as CUDA. Huang's vision was to enable CUDA to work on every GeForce card.

p.32

As Buck developed the software, Nvidia's hardware team began allocating space on the microchips for supercomputing operations. The chips contained billions of electronic transistors, which routed electricity through labyrinthine circuits to complete calculations at extraordinary speed. Arjun Prabhu, Nvidia's lead chip engineer, compared microchip design to urban planning, with different tasks. As Tetris players do with falling blocks, Prabhu will sometimes see transistors in his sleep. “I've often had it where the best ideas happen on a Friday night, when I'm literally dreaming about it,” Prabhu said.

p.32

When CUDA was released, in late 2006, Wall Street reacted with dismay.

p.32

“They were spending a fortune on this new chip architecture,” Ben Gilbert, the co-host of “Acquired,” a popular Silicon Valley podcast, said. “They were spending many billions targeting an obscure corner of academic and scientific computing, which was not a large market at the time ── certainly less than the billions they were pouring in.”

p.32

Huang argued that the simple existence of CUDA would enlarge the supercomputing sector. This view was not widely held, and by the end of 2008 Nvidia's stock price had declined by 70 per cent. ([ this would be the time to buy, knowing what we know now ])

p.32

Ting-Wai Chiu, a professor of physics at National Taiwan university,

had constructed a homemade supercomputer in a laboratory adjacent to his office.

Huang arrived to find the lab littered with GeForce boxes and the computer cooled by oscillating desk fans. “Jensen is a visionary,” Chiu told me. “He made my life's work possible.”

Chiu was the model customer, but there weren't many like him.

p.33

Downloads of CUDA hit a peak of 2009, then declined for three years. Board members worried that Nvidia's depressed stock price would make it a target for corporate raiders. “We did everything we could to protect the company against an activist shareholder who might come in and try to break it up”, Jim Gaither, a longtime board member, told me.

([ this would be the time to buy, knowing what we know now ])

p.33

Dawn Hudson, a former N.F.L. marketing executive, joined the board in 2013. “It was a distinctly flat, stagnant company”, she said.

p.33

In marketing CUDA,

‘’•─“”

pp.33─34

p.33

One application that Nvidia spent little time thinking about was artificial intelligence. There didn't seem to be much of a market.

At the beginning of the 2010, A.I. was a neglected discipline. Progress in basic tasks such as image recognition and speech recognition had seen only halting progress. Within this unpopular academic field, an even less popular subfield solved problems using “neural networks” ── computing structures inspired by the human brain. Many computer scientists considered neural networks to be discredited. “I was discouraged by my advisers from working on neural nets”, Catanzaro, the deep-learning researcher, told me, “because, at the time, they were considered to be outdated, and they didn't work.”

pp.33─34

Catanzaro, the deep-learning researcher

Catanzaro described the researchers who continued to work on neural nets as “prophets in the wilderness.” One of those prophets was Geoffrey Hinton, a professor at the University of Toronto. In 2009, Hinton's research group used Nvidia's CUDA platform to train a neural network to recognize human speech. He was surprised by the quality of the results, which he presented at a conference later that year. He then reached out to Nvidia. “I sent an e-mail saying, ‘Look, I just told a thousand machine-learning researchers they should go and buy Nvidia cards. Can you send me a free one?’” Hinton told me. “They said no.”

Despite the snub, Hinton encouraged his students to use CUDA, including a Ukrainian-born protege of his named Alex Krizhevsky, who Hinton thought was perhaps the finest programmer he'd ever met. In 2012, Krizhevsky and his research partner, Ilya Sutskever, working on a tight budget, bought two GeForce cards from Amazon. Krizhevsky then began training a visual-recognition neural network on Nvidia's parallel-computing platform, feeding it millions of images in a single week. “He had the two G.P.U. boards whirring in his bedroom,” Hinton said. “Actually, it was his parents who paid for the quite considerable electricity costs.”

Sutskever and Krizhevsky were astonished by the cards' capabilities. Earlier that year, researchers at Google had trained a neural net that identified videos of cats, an effort that required some 16,000 C.P.U.s.

Sutskever and Krizhevsky had produced world-class results with just two Nvidia circuit boards. “G.P.U.s showed up and it felt like a miracle,” Sutskever told me.

AlexNet, the neural network that Krizhevsky trained in his parents' house, can now be mentioned along-side the Wright flyer ([ Wright brothers heavier than air flying machine; traditional object heavier than air can not float, rise, or fly; balloon can float because it produces enough hot air to be lighter than the surrounding atmosphere; aircraft with wings can fly by moving the wing through the air at fast enough speed (or with strong enough wind), causes lift (flight, flying) ]) and the Edison bulb. In 2012, Krizhevsky entered AlexNet into the annual ImageNet visual-recognition contest; neural networks were unpopular enough at the time that he was the only contestant to use this technique ([ what were other techniques ]). AlexNet scored so well in the competition that the organizers initially wondered if Krizhevsky had somehow cheated. “That was a kind of Big Bang moment”, Hinton said. “That was the paradigm shift.”

“That was a kind of Big Bang moment”, Hinton said. “That was the paradigm shift.”

In the decade since Krizhevsky's nine-page description of AlexNet's architecture was published, it has been cited more than a hundred thousand times, making it one of the most important papers in the history of computer science. (AlexNet correctly identified photographs of a scooter, a leopard, and a container ship, among other things.)

Krizhevsky pioneered a number of important programming technique, but his key finding was that a specialized G.P.U. could train neural networks up to a hundred times faster than a general-purpose C.P.U. “To do machine learning without CUDA would have just been too much trouble,” Hinton said.

Within a couple years, every entrant in the ImageNet competition was using a neural networks trained on G.P.U.s were identifying images with 96 per cent accuracy, surpassing humans.

‘’•─“”

p.34

“The fact that they can solve computer vision, which is completely unstructured, leads to the question ‘What else can you teach it?’” Huang said to me.

The answer seemed to be: everything.

p.34

Huang concluded that neural networks would revolutionize society, and that he could use CUDA to corner the market on the necessary hardware.

“He sent out an e-mail on Friday evening saying everything is going to deep learning, and that we were no longer a graphics company,” Greg Estes, a vice-president at Nvidia, told me. “By Monday mornning, we were an A.I. company. Literally, it was that fast.”

p.35

in 2017, a new architecture for neural net training called a transformer.

The following year,

Open AI used Google's framework to build the first

“generative pre-trained transformer”, or G.P.T.

The G.P.T. models were trained on Nvidia supercomputers, absorbing an enormous corpus of text and learning how to make human like connections.

(The new yorker, Dec. 4, 2023, brave new world dept., The chosen chip : how nvidia is powering the A.I. revolution., By Stephen Witt., pp.28─37, )

____________________________________

• (2006) Hinton and colleagues's landmark paper: Geoffrey Hinton, Simon Osindero, and Yee-Whye The, “A fast learning algorithm for deep belief nets”, Neural computation 18 (2006)

• (2006) CUDA was released, in late 2006, p.32, (The new yorker, Dec. 4, 2023, brave new world dept., The chosen chip : how nvidia is powering the A.I. revolution., By Stephen Witt., pp.28─37, )

• (2009) Downloads of CUDA hit a peak of 2009, then declined for three years., p.33, (The new yorker, Dec. 4, 2023, brave new world dept., The chosen chip : how nvidia is powering the A.I. revolution., By Stephen Witt., pp.28─37, )

• (2012) In 2012, Krizhevsky entered AlexNet into the annual ImageNet visual-recognition contest; pp.33─34, (The new yorker, Dec. 4, 2023, brave new world dept., The chosen chip : how nvidia is powering the A.I. revolution., By Stephen Witt., pp.28─37, )

• 2016, a neural network build by Hinton's team demolish the competition in an international computer vision contest.

Kai-Fu Lee., AI superpowers: China, Silicon Valley and the new world order, 2018

p.238

Hinton and colleagues's landmark paper:

Geoffrey Hinton, Simon Osindero, and Yee-Whye The,

“A fast learning algorithm for deep belief nets”,

Neural computation 18 (2006): 1527-1554.

p.9

Soon, these neural networks ─ now rebranded as “deep learning” ─ could out perform older models at a variety of task.

p.9

The turning point came in 2016, when a neural network build by Hinton's team demolish the competition in an international computer vision contest.

HC79.155 L435 2018 (print); 338.4; https://lccn.loc.gov/2018-17250; 2018, )

____________________________________

A Fireside Chat with Turing Award Winner Geoffrey Hinton, Pioneer of Deep Learning (Google I/O'19)

https://youtu.be/UTfQwTuri8Y

39:01

TensorFlow

May 9, 2019

it was 40 years ago,

it seems to me there is no other way the brain could work,

it has to work by learning the strengths of connections.

And, if you want to make a device do something intelligent,

you've got two options.

you can program it, or it can learn.

And we certainly weren't programmed.

So we had to learn.

So this had to be the right way to go.

Neural network explain

----------------------

so you have relatively simple processing

elements that are very loosely models of neurons.

They have connections coming in.

Each connection has a weight on it.

That weight can be changed to do learning.

And what a neuron does is take the activities

on the connections times (x) the weights, adds them all up,

and then decide whether to send an output.

And if it gets a big enough sum, it sends an output.

If the sum is negative, it doesn't send anything.

That's about it.

it's just a question of how you change the weights.

it was designed to be like how the brain works.

the whole idea was to have a learning device that

learned like the brain, like people think the brain learns

by changing the connection strengths.

this wasn't my idea

Turing,

he believed that the brain was this unorganized device

with random weights.

And it would use reinforcement learning

to change the connections.

And it would learn everything, and

he thought that was the best route to intelligence.

it wasn't just Turing. Lots of people thought that back then.

it turns out that it was mainly a question of scale

just by trying to model the structure of the data

I actually still believe that.

YOu can say this model finds the data less surprising than this.

Then around the same time, they started developing the GPUs.

The people doing neural networks started using GPUs in about 2007.

And so they were using this idea of pre-training.

after they've pre-training, then they'd just stick labels

on top and use back propagation.

And it turned out that way, you could have a very deep net

that was pre-trained this way.

you can use back propagation and it actually work

since it was beating standard models that

are taking 30 years to develop with a bit more development

would do really well.

Google was the fastest to turn it into a production speech recognizer.

And by 2012, that work was first done in 2009

came out in Android.

And Android suddenly got better speech recognition

it felt really good that it got state of the art on real problem

George Dahl,

this stuff is going work for image recognition

Fei-Fei Li has created the correct data set for it,

And so what we did was take an approach originally developed

by Yann LeCun.

A student called Alex Krizhevsky was a real wizard.

He could make GPUs do anything.

Programmed the GPUs really, really well.

And we got results that were a lot better

than standard computer vision.

that was 2012.

Kaggle, modeling chemical molecule, predictor of molecule binding

if you told me in 2012 that in the next five years,

we'll be able to translate between many languages using

just the same technology, recurrent nets,

but just the stochastic gradient descent

from random initial weights,

I wouldn't have believed you.

It happened much faster than expected.

SO I think what we've learned in the last 10

years is that if you take a system with billions

of parameters, and you'd use stochastic gradient descent

in some objective function,

and the objective function might be to get the right labels

or it might be to fill in the gap in a string of words,

or any objective function, it works much better than it

has any right to.

it works much better than you would expect.

You would have thought, and most people in conventional AI

thought, take a system with a billion parameters,

start them off with random values,

measure the gradient of the objective function.

That is, for each parameter figure out how the objective function

would change if you change that parameter a little bit.

And then change it in that direction that improves

the objective function.

You would have thought that would be a kind of hopeless

algorithm that would get stuck.

And it turns out, it's a really good algorithm.

And the bigger you scale things, the better it works.

And that's just an empirical discovery really.

There's some theory coming along,

but it's basically an empirical discovery.

Now because we've discovered that,

it makes it far more plausible that the brain

is computing the gradient of some objective function

and updating the weights of strengths of synapses

to follow that gradient.

And we know now that's wrong.

You can just put in random parameters and learn everything.

ONe theory of Dreaming (unlearning)

Boltzmann machine learning algorithm

And the Boltzmann machine learning algorithm

had a very interesting property, which is I show you data.

That is, I fixed the states of the observeable units.

And it sort of rattles around the other units

until it's got a fairly happy state.

And once it's done that, it increases

the strength of all the connection based

on if two units are both active, it increases the connection strength.

That's called kind of Hebbian learning.

But you just do that, the connection strengths

just get bigger and bigger.

You also have to have a phase where you cut it off from the input.

You let it rattle around to settle into a state it's happy with.

So now it's having a fantasy.

And once it's had the fantasy you say,

take all passive neurons that are active

and decreases the strength to the connection.

So I'm explaining the algorithm to you just as a procedure.

But actually that algorithm is the result of doing some math

and saying, how should you change these connection

strengths so that this neural network with all

these hidden units finds the data unsurprising?

And it has to have this other phase.

It has to have this what we call the negative phase when

it's running with no input.

And it's cancelling out - its unlearning whatever state it settles into.

Terry Sejnowski and I showed that actually that

is a maximum ... learning procedure for Boltzmann machines.

so that's one theory of dreaming

Yeah, we show theoretically

that's the right thing to do if you want

to change the weights so that you big neural network finds

the observed data less surprising.

So yes, we had machines learning algoriths.

Some of the first algorithms that

could learn what to do with hidden units

were Boltzmann machines.

Those were the things that learned one layer feature

detector at a time.

And it was an efficient form of restricted Boltzmann machine.

wake-sleep algorithm

____________________________________

Geoffrey Hinton: The Foundations of Deep Learning

https://www.youtube.com/watch?v=zl99IZvW7rE

28:21

Elevate

Feb 7, 2018

it worked better, it worked just a little bit better

but good speech people, particularly down at Microsoft

realized right away that if this works a little bit better

and two graduate students did it in few months

it's going to completely wipe out the existing state-of-the-art

and indeed over the next couple years, it did.

google uses it, and it suddenly got better than Siri

now all speech recognition are trained with back propagation

and neural net

Error rates on the ImageNet-2012 competition

• 2017 deep neural nets • 3%

• 2015 deep neural nets (or people!) • 5%

• university of toronto (Krizhevsky et al, 2012) • 16%

• university of tokyo • 26%

• oxford university (zisserman et al) • 27%

• INRIA (French national research insitute in CS)

+ XRCE (Xerox research center europe) • 27%

• university of amsterdam • 29%

recurrent neural networks

in medical images, very soon we will be better than radiologist

in skin cancer, we have a system that is comparable to radiologist

with dermatologist

____________________________________

• (2006) Hinton and colleagues's landmark paper: Geoffrey Hinton, Simon Osindero, and Yee-Whye The, “A fast learning algorithm for deep belief nets”, Neural computation 18 (2006)

• 2016, a neural network build by Hinton's team demolish the competition in an international computer vision contest.

([ if you listened to the following 18 minute video, you're going to about Hinton and his colleagues, the hardware (Nvidia), ImageNet (image dataset), and the fast learning algorithm (Hinton and colleagues) ])

https://en.wikipedia.org/wiki/ImageNet

([ so why I am mentioning this; until now, I had no idea that Nvidia was the hardware that Hinton and gang were using to do their machine learning ])

How Nvidia Won AI

https://www.youtube.com/watch?v=GuV-HyslPxk

Asianometry

234,119 views Feb 20, 2022

When we last left Nvidia, the company had emerged victorious in the brutal graphics card Battle Royale throughout the 1990s.

Very impressive. But as the company entered the 2000s, they embarked on a journey to do more. Moving towards an entirely new kind of microprocessor - and the multi-billion dollar market it would unlock.

In this video, we are going to look at how Nvidia turned the humble graphics card into a platform that dominates one of tech’s most important fields: Artificial Intelligence.

Links:

- The Asianometry Newsletter: https://asianometry.substack.com

- Patreon: https://www.patreon.com/Asianometry

- The Podcast: https://anchor.fm/asianometry

- Twitter: https://twitter.com/asianometry

____________________________________

IBM Watson

en.wikipedia.org

https://en.wikipedia.org/wiki/IBM_Watson

Siri

en.wikipedia.org

https://en.wikipedia.org/wiki/Siri

Nuance communication

https://en.wikipedia.org/wiki/Nuance_Communications

speech recognition

https://en.wikipedia.org/wiki/Speech_recognition

Amazon Echo (Alexa)

en.wikipedia.org

https://en.wikipedia.org/wiki/Amazon_Echo

OpenAI (company), chatGPT 4 (query and response text chat bot)

en.wikipedia.org

https://en.wikipedia.org/wiki/OpenAI

GitHut copilot

https://en.wikipedia.org/wiki/GitHub_Copilot

GTP-3

Generative Pre-trained Transformer 3 (GPT-3) is a large language model released by OpenAI in 2020.

https://en.wikipedia.org/wiki/GPT-3

OpenAI codex

https://en.wikipedia.org/wiki/OpenAI_Codex

CALO

"Cognitive Assistant that Learns and Organizes"

https://en.wikipedia.org/wiki/CALO

CUDA

CUDA (or Compute Unified Device Architecture) is a proprietary and closed source parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for general purpose processing, an approach called general-purpose computing on GPUs (GPGPU). CUDA is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.[1]

https://en.wikipedia.org/wiki/CUDA

Stephen Witt., , The chosen chip : how nvidia is powering the A.I. revolution., The new yorker., Dec. 4, 2023

p.35

in 2017, a new architecture for neural net training called a transformer.

The following year,

Open AI used Google's framework to build the first

“generative pre-trained transformer”, or G.P.T.

The G.P.T. models were trained on Nvidia supercomputers, absorbing an enormous corpus of text and learning how to make human like connections.

(The new yorker, Dec. 4, 2023, brave new world dept., The chosen chip : how nvidia is powering the A.I. revolution., By Stephen Witt., pp.28─37, )

____________________________________

put the text selection from Amazon Unbound on Amazon Echo here

(done - Thur 27 Jan 2022)

Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent [ 3% ] increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

p.23

The initiative was originally designated inside Lab126 as Project D. It would come to be known as the Amazon Echo, and by the name of its virtual assistant, Alexa.

p.24, p.45

Project D, also known as ‘Amazon Alexa’, later named ‘Amazon Echo’

January 4, 2011, first email from Bezos on Project D, p.24

November 6, 2014, product launch, p.45

([

within a four year time horizon Amazon developed a voice-enable user interface, inside a real─world working product,

─ development far─field speech recognition

─ refine speech communication (speak and sound like natural voice)

─ backoffice technical development

─ developed the plan to gather enough data for the far─field speech recognition

─ the heavy lifting of the speech recognition and other sensory data processing happen at the data center

─ need internetwork [Internet or VPN] connection with the data center

─ (( I would be interested to know, if you were to connect an Amazon Echo inside a corporate network, configure the device with a proxy server to communicate to the Amazon server; who what else does the Echo need to connect to work properly; how would a corporate firewall react to this new traffic. ))

─ port number for Amazon Echo (Alexa)

─ for example, port number for e─mail is 25, or, is it 24

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent [3%] increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

])

p.462 Index

Amazon Alexa, 26─38

AMPED and, 43─44

beta testers

Bezos's sketch for,

bug in,

as Doppler project, 26─38, 40, 42─47

Evi and, 34─36

Fire tablet and, 44

language─specific version of, 60

launch of, 44─46

name of, 32

Skills Kit, 44─46

social cue recognition in, 34─35

speech recognition in,

voice of, 27─30

voice service, 47

see also Amazon Echo

far─field speech recognition, 27─28

p.24

Greg Hart

([ in 2010, Greg Hart pointed out to Jeff Bezos that speech recognition technology was good at dictation and search; he did this by showing to Jeff, Google's voice search on an Android phone; ])

speech recognition 2010

Google's voice search, Android phone

technology was finally getting good at dictation and search

p.24

Hart remembered talking to Bezos about speech recognition one day in late 2010 at Seattle's Blue Moon Burgers. Over lunch, Hart demonstrated his enthusiasm for Google's voice search on his Android phone by saying, “pizza near me”, and then showing Bezos the list of links to nearby pizza joints that popped up on-screen. “Jeff was a little skeptical about the use of it on phones, because he thought it might be socially awkward”, Hart remembered. But they discussed how the technology was finally getting good at dictation and search.

p.24

January 4, 2011

Greg Hart,

Ian Freed, device vice president,

Steve Kessel

Amazon's HQ, Day 1 North building

p.25

voice-activated cloud computer

speaker, microphone, a mute button

Fiona, the Kindle building

p.26

One early recruit, Al Lindsay,

Al Lindsay, who in a previous job had written some of the original code for telco US West's voice-activated directory assistance. Lindsay spent his first three weeks on the project on vacation at his cottage in Canada, writing a six-page narrative that envisioned how outside developers might program their own voice-enabled apps that could run on the device.

p.26

internal recruit,

John Thimsen, director of engineering

p.26

To speed up development

Hart and his crew started looking for startups to acquire.

p.27

Yap, a twenty-person startup based in Charlotte, North Carolina, automatically translated human speech such as voicemails into text, without relying on a secret workforce of human transcribers

p.27

though much of Yap's technology would be discarded, its engineers would help develop the technology to convert what customers said into a computer-readable format.

p.27

industry conference in Florence, Italy

Amazon's newfound interest in speech technology

p.27

Jeff Adams, Yap's VP of research

two-decade veteran of the speech industry

pp.27-28

after the meeting, Adams delicately told Hart and Lindsay that their goals were unrealistic. Most experts believed that true “far-field speech recognition” ── comprehending speech from up to 32 feet away, often amid crosstalk and background noise ── was beyond the realm of established computer science, since sound bounces off surfaces like walls and ceilings, producing echoes that confuse computers.

“They basically told me, ‘We don't care. Hire more people. Take as long as it takes. Solve the problem,’” recalled Adams. “They were unflappable.”

p.28

Polish startup Ivona generated computer-synthesized speech that resembled a human voice.

Ivona was founded ìn 2001 by Lukasz Osowski, a computer science student at the Gdansk university of technology. Osowski had the notion that so-called “text-to-speech”, or TTS, could read digital texts aloud in natural voice and help the visually impaired in Poland appreciate the written word.

Michael Kaszczuk

he took recording of an actor's voice and selected fragments of words, called diphones, and then blended or “concatenated” them together in different combinations to approximate natural-sounding words and sentences that the actors might never have uttered.

p.28

While students, they paid a popular Polish actor named Jacek Labijak to record hours of speech to create a database of sounds. The result was their first product, Spiker, which quickly became the top-selling computer voice in Poland.

Over the next few years, it was used widely in subways, elevators, and for robocall campaigns.

p.29

annual Blizzard Challenge, a competition for the most natural computer voice, organized by Carnegie Mellon university.

p.29

Gdansk R&D center were put in charge of crafting Doppler's voice.

p.29

the team considered lists of characteristics they wanted in a single personality, such as trustworthiness, empathy, and warmth, and determined those traits were more commonly associated with a female voice.

pp.29-30

Atlanta area-based voice-over studio, GM Voices, the same outfit that had helped turn recording from a voice actress named Susan Bennett into Apple's agent, Siri.

p.30

To create synthetic personalities, GM Voices gave female voice actors hundreds of hours of text to read, from entire books to random articles, a mind-numbing process that could stretch on for months.

p.30

voice artist behind Alexa

professional voice-over community: Boulder-based singer and voice actress Nina Rolle.

warm timbre of Alexa's voice

Nina Rolle (Boulder-based singer and voice actress)

p.32

Bezos also suggested “Alexa”, an homage to the ancient library of Alexandria, regarded as the capital of knowledge.

p.32

[ seven omnidirectional microphones ] at the top

a cylinder elongated to create separation between the array of seven omnidirectional microphones at the top and the speakers at the bottom, with some 14 hundred holes punctured in the metal tubing to push out air and sound.

p.34

In 2012, inspired by Siri's debut, Tunstall-Pedoe pivoted and introduced the Evi app for the Apple and Android app stores. Users could ask it questions by typing or speaking. Instead of searching the web for answer like Siri, or returning a set of links, like Google's voice search, Evi evaluated the question and tried to offer an immediate answer. The app was downloaded over 250,000 times in its first week and almost crashed the company's servers.

p.34

Evi employed a programming technique called knowledge graphs, or large databases of ontologies, which connect concepts and categories in related domains. If, for example, a user asked Evi, “What is the population of Cleveland?” the software interpreted that question and knew to turn to an accompanying source of demographic data. Wired described the technique as a “giant treelike structure” of logical connections to useful facts.

Putting Evi's knowledge base inside Alexa helped with the kind of informal but culturally common chitchat called phatic speech.

p.35

Integrating Evi's technology helped Alexa respond to factual queries, such as requests to name the planets in the solar system, and it gave the impression that Alexa was smart. But was it? Proponents of another method of natural language understanding, called deep learning, believed that Evi's knowledge graphs wouldn't give Alexa the kind of authentic intelligence that would satisfy Bezos's dream of a versatile assistant that could talk to users and answer any question.

p.35

In the deep learning method, machines were fed large amounts of data about how people converse and what responses proved satisfying, and then were programmed to train themselves to predict the best answers.

p.35

The chief proponent of this approach was an Indian-born engineer named Rohit Prasad. “He was a critical hire”, said engineering director John Thimsen. “Much of the success of the project is due to the team he assembled and the research they did on far-field speech recognition.”

p.35

BBN Technologies (later acquired by Raytheon)

Cambridge, Massachusetts-based defense contractor

At BBN, he [Rohit Prasad] worked on one of the first in-car speech recognition systems and automated directory assistance services for telephone companies.

p.37

For years, Google also collected speech data from a toll-free directory assistance line, 800-GOOG-411.

p.37

Hart, Prasad, and their team created graphs that projected how Alexa would improve as data collection progressed. The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent increase in Alexa's accuracy.

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

p.37

“How will we even know when this product is good?”

early 2013

p.38

“First tell me what would be a magical product, then tell me how to get there.”

p.38

Bezos's technical advisor at the time, Dilip Kumar,

p.38

they would need thousands of more hours of complex, far-field voice commands.

p.38

Bezos apparently factored in the request to increase the number of speech scientists and did the calculation in his head in a few seconds.

“Let me get this straight. You are telling me that for your big request to make this product successful, instead of it taking forty years, it will only take us twenty?”

p.42

the resulting program, conceived by Rohit Prasad and speech scientist Janet Slifka over a few days in the spring of 2013

p.42

Rohit Prasad and speech scientist Janet Slifka

spring of 2013

p.42

answer a question that later vexed speech experts ──

how did Amazon come out of nowhere to leapfrog Google and Apple in the race to build a speech-enabled virtual assistant?

pp.42-43

internally the program was called AMPED

Amazon contracted with an Australian data collection firm, Appen, and went on the road with Alexa, in disguise.

p.43

Appen rented homes and apartments, initially in Boston, and then Amazon littered several rooms with all kinds of “decoy” devices: pedestal microphones, Xbox gaming consoles, televisions, and tablets. There were also some twenty Alexa devices planted around the rooms at different heights, each shrouded in an acoustic fabric that hid them from view but allowed sound to pass through.

p.43

Appen then contracted with a temp agency, and a stream of contract workers filtered through the properties, eight hours a day, six days a week, reading scripts from an iPad with canned lines and open-ended request

p.43

The speakers were turned off, so that Alexa didn't make a peep, but the seven microphones on each device captured everything and streamed the audio to Amazon's servers. Then another army of workers manually reviewed the recordings and annotated the transcripts, classifying queries that might stump a machine,

p.43

so that next time, Alexa would know.

p.43

The Boston test showed promise, so Amazon expanded the program, renting more homes and apartments in Seattle and ten other cities over the next six months to capture the voices and speech patterns of thousands more paid volunteers. It was a mushroom-cloud explosion of data about device placement, acoustic environments, background noise, regional accents, and all the gloriously random ways a human being might phrase a simple request to hear the weather, for example, or play a Justin

p.44

by 2012

multimillion-dollar cost.

p.44

By 2014, it has increased its store of speech data by a factor of ten thousand and largely closed the gap with rivals like Apple and Google.

p.47

over the next few months, Amazon would roll out the Alexa Skills Kit, which allowed other companies to build voice-enabled apps for the Echo, and Alexa Voice Service, which let the makers of products like lightbulbs and alarm clocks integrate Alexa into their own devices.

p.47

a smaller, cheaper version of Echo, the hockey puck-sized Echo Dot,

a portable version with batteries, the Amazon Tap.

Echo

Echo dot

Amazon Tap (a portable batteries version of Echo)

─“”‘’•

p.24

January 4, 2011

p.45

November 6, 2014

Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021

____________________________________

artificial intelligence

DARPA program

CALO (Cognitive Assistant that Learns and Organizes)

CALO, years later, helped inspire the creation of Siri.

CALO (Cognitive Assistant that Learns and Organizes)

Siri

Google's voice search

Amazon Echo

Henry Kressel, If you really want to change the world, 2015 [ ]

p.12

SRI (formerly the Stanford Research Institute), one of the world's largest independent research institutes

SRI won a project under DARPA program and called it CALO (Cognitive Assistant that Learns and Organizes).4 CALO, years later, helped inspire the creation of Siri.

p.12

CALO developed into a massive program under the leadership of Bill Mark; Ray Perrault, director of the artificial intelligence center; Adam Cheyer, David Israel, Karen Myers, and Tom Garvey, program directors in the artificial intelligence center; Tom Dietterich, professor at Oregon State University; and many others. DARPA funded the program from 2003 to 2009, and it included the participation of more than 23 universities (including Stanford University, Carnegie Mellon, UC Berkeley, and MIT) and labs from the who's who of the artificial intelligence world. At more than $180 million, CALO was the largest artificial intelligence program in the history of DARPA. Concepts from the CALO program contributed to the basis of Siri and subsequent ventures.

p.15

Siri would be a “do engine” ...

Siri would allow people to buy tickets, make reservations, get the weather report, and find a movie by speaking into a smartphone. Siri would give them answers, no links.

p.67

Adam Cheyer, VP of engineering at Siri, throughout his career kept a list of the top five people in various technological fields. In meetings, Adam would talk about his recruiting progress with statement like, “I've got three of the top five people in this field. I'm going after the other two this month.” As a result of having this top talent, the Siri team exceeded goals and expectations at every stage.

(Kressel, Henry, If you really want to change the world : guide to creating, building, and sustaining breakthrough ventures / Henry Kressel, Norman Winarsky., 1. New business enterprises., 2. Venture capital., 3. Entrepreneurship., 2015, 658.11 Kressel, )

____________________________________

• “expectations maximization algorithm”, Leonard Baum

• computerized translation of foreign language [text] [or scripture]

• idea of “statistical machine translation”

• Canada's parliamentary records, which contain thousands of pages of paired passages in French and English

• Canadian Hansard

• database of parliamentary speeches

• Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, “The mathematics of statistical machine translation: parameter estimation”, computation linguistics 19, no. 2 (1993).

• Andy Way “A critique of statistical machine translation”. In W. Daelemans and V. Hoste (eds.), Journal of translation and interpreting studies: special issue on evaluation of translation technology, Linguistica antverpiensia, 2009, pp.17-41.

Sebastian Mallaby., More money than god : hedge funds and the making of a new elite, 2010.

pp.298-301

p.298

in 1993

Peter Brown and Robert Mercer.

They came from IBM's research center,

Before arriving at Renaissance, Brown and Mercer had worked a little on cryptography, but their real achievement lay elsewhere.

They had upended a related field ── that of computerized translation.

p.298

on translation, the subject was dominated by programmers who actually spoke some foreign languages. The approach was to understand the language from the inside, to know its grammer and its syntax, and to teach the computer that “la fille” means “the girl” and “les filles” is the plural form, much as you might reach a middle schooler.

p.299

But Brown and Mercer had a different method. They did not speak French, and they were not about to wade into its syntax or grammer. Instead, they got hold of Canada's parliamentary records, which contain thousands of pages of paired passages in French and English. Then they fed the material into an IBM workstation and told it to figure out the correlations.

p.299

their experiment at IBM was written up and published.21 It began with some scrubbing of data: Just as financial-market price histories must be checked for “bad tics” ── places where a sale is reported at $16 instead of $61 ── so the Canadian Hansard contained misprinted words that might confuse a translation program. Next, the computer began to search the data for patterns.

p.299

For all it knew at the outset, a given English word [and common English phrasings] was equally likely to be translatable into any of the 58,000 French words [and common French phrasings] in the sample, but once the computer had checked through the twinned passages, it found that most English words appeared in only some: Immediately, nearly 99 percent of the uncertainty was eliminated. Then the computer proceeded with a series of more subtle tests; for example, it assumed that an English word was most likely to correspond to a French word that came in the same position in the sentence. By now some word pairs were starting to appear: Couplings such as lait/milk and pourquoil/why shouted from the data. But other correlations spoke in a softer voice.

p.299

To hear them clear, you had to comb the data multiple times, using slightly different algorithm at each turn. “Only in this way can one hope to hear the quiet call of marqué d'un asterisque/starred or the whisper of qui s'est fait bousculer/embattled”, Brown and Mercer reported.

p.299

To the code breaker at the Institute for Defense Analyses, this method would not have seemed surprising.22

“expectations maximization algorithm”, Leonard Baum

p.299

Indeed, Brown and Mercer used a tool called the “expectations maximization algorithm”, and they cited its inventor Leonard Baum ── who had worked for IDA [Institute for Defense Analyses] and then later for Simons.23

p.299

But although the idea of “statistical machine translation” seemed natural to the code breakers, it was greeted with outrage by traditional programmers. A reviewer of the Brown-Mercer paper scolded the “the crude force of computers is not science”,

p.300

and when the paper was presented at a meeting of translation experts, a listener recalled, “We were all flabbergasted .... People where shaking their heads and spurting grunts of disbelief or even of hostility.”

“Where's the linguistic intuition?” the audience wanted to know ── to which the answer seemed to be, “Yes that's the point; there isn't any”.

Fred Jelinek, the IBM manager who oversaw Brown and Mercer, poured salt into the wounds. “Every time I fire a linguist, my system's performance improves”, he told the naysayers.24

p.300

By the time Brown and Mercer joined Renaissance in 1993, the skeptics were capitulating. Once the IBM team's program had figured out the sample passages from the Canadian Hansard, it could translate other material too: If you presented it with an article in a French newspaper, it would zip through its database of parliamentary speeches, matching the article's phrases with the decoded material. The results outclassed competing translation systems by a wide margin, and within a few years the advent of statistical machine translation was celebrated among computer scientists as something of an intellectual revolution.25

p.300

Canadian political rhetoric had proved more useful than suspected hitherto. And Brown and Mercer had reminded the world of a lesson about artificial intelligence.

The lesson concerned the difference between human beings and computers.

p.300

The early translation programs had tried to teach computers vocabulary and grammar because that's how people learn things.

p.300

But a computers are better suited to a different approach: They can learn to translate between English and French without paying much attention to the rules of either language. Computer don't need to understand verb declensions or adjectival inflections before they approach a pile of political speeches; they prefer to get the speeches first, then penetrate their code by combing through them algorithmically.

p.300

Likewise, computers have no trouble committing millions of sentences to memory; they can learn languages in chunks, without the crutch of grammatical rules that human students use to prompt their memories.

pp.300-301

For example, a computer can remember the English translations for phrases as “la fille est intelligente, les filles sont intelligentes”, and a dozen other variations besides; they do not necessarily need to understand that “fille” is the singular form of “filles”, that “est” and “sont” are different forms of the verb “être”, and so on.26

p.301

Contrary to the harrumphing of the IBM team's critics, the crude force of a computer's memory can actually substitute for human notions of intelligence and science. And computers are likely to work best when they don't attempt to reach results in the way that humans would do.

p.301

Brown and Mercer fed the data into the computer first and let it come up with the answers.

“”─“”‘’•“”

p.453

21. See, for example, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, “The mathematics of statistical machine translation: parameter estimation”, computation linguistics 19, no. 2 (1993). As noted below, the Della Pietra brothers followed Brown and Mercer from IBM to Renaissance Technologies.

p.454

22. As far back as 1949, code breakers had wondered about the application of their technique to translation. But they lacked computing power; statistical translation depended on feeding a vast number of pairs of sentences into a computer, so that the computer had enough data from which to extract meaningful patterns. But by around 1990, statistical translation was possible on a well-equipped workstation.

23.

24. An account of the reaction to the Brown-Mercer work is given in Andy Way “A critique of statistical machine translation”. In W. Daelemans and V. Hoste (eds.), Journal of translation and interpreting studies: special issue on evaluation of translation technology, Linguistica antverpiensia, 2009, pp.17-41.

25. See, for example, Pius Ten Hacken, “Has there been a revolution in machine translation?” Machine Translation 16, no. 1 (March 2001): pp. 1-19.

26. The initial version of the IBM program included no linguistic rules at all. Later versions did use some, but they played a far smaller role than in the traditional translation programs.

p.454

29.

explicitly presented their experience with statistical machine translation as relevant to finding order in other types of data, including financial data. See Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra, “A maximum entropy approach to natural language processing”, computational linguistics 22, no. 1 (March 1996): pp.39-71.

(More money than god : hedge funds and the making of a new elite / Sebastian Mallaby., 1. hedge funds., 2. investment advisors., HG4530.M249 2010, 332.64'524──dc22, 2010, )

____________________________________

• "Data Science", which is the automatic (or semi-automatic) extraction of knowledge from data.;── Yann LeCun (self.MachineLearning).

• the goal of extracting information from data.;── Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman.

• ... and thus discover something about data that will be seen in the future.;── Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman.

• All algorithms for analysis of data are designed to produce a useful summary of the data, from which decisions are made.;── Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman, Mining of massive datasets, 2010.'; http://infolab.stanford.edu/~ullman/mmds/book.pdf

http://infolab.stanford.edu/~ullman/mmds/book.pdf

(( last checked: Thur May 13, 2021 [up] ))

____________________________________

A Artificial intelligence - machine learning - unsupervised machine learning

B Big data - data science

C Cloud - cloud computing - warehouse computing - data center

- Amazon Cloud, same as Amazon AWS (Amazon Web Services))

- Microsoft Azure, and others

- software as a service

- APIs as a service

ABC

A AI - artificial intelligence (Machine learning,

Data mining, Deep Learning)

this neural network machine deep learning

https://en.wikipedia.org/wiki/Deep_learning

not this educational meaning of deep learning

https://en.wikipedia.org/wiki/Deeper_learning

B Big data (Data science)

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Data_science

http://phys.org/news/2015-10-human-intuition-algorithms-outperforms-teams.html

C Cloud computing - Amazon Web Services (AWS),

Microsoft Azure,

IBM cloud,

Google Cloud Platform (GCP),

____________________________________

•─ aerospace, communications, and electronics (ACE) sectors,

Malcolm Harris, Palo alto : a history of california, capitalism, and the world

by Malcolm Harris, 2023

klystron, 99, 189─92, 194, 223, 247, 253─54, 255

p.223

aerospace, communications, and electronics (ACE) sectors,

specific exceptional competencies in growing ACE subfields that made Stanford an irresistible lure for federal and private research funds.

Varian klystron (which remained a source of passive income for the university),

to found the (not “a” or “Stanford”, “the”) Microwave Lab.

That meant the government paid for new and expensive building-size research machines, including particle accelerators, nuclear reactors, and computers.6

6. Audra J. Wolfe, Competing with the soviets: science, technology, and the state in cold war america, 2013, 42.

(Palo alto : a history of california, capitalism, and the world

by Malcolm Harris, 2023)

____________________________________

Barry Boehm oral history

Computer history museum

Oral history of Barry Boehm, part 2 of 2

interviewed by:

David C. Brock

Lee Osterweil

recorded February 20, 2018

TRW

cocomo [constructive cost model] model to estimate whether we would be able to improve productivity.

biggest thing outside of defense is auto parts,

used the COCOMO model to say, "If your tools are better than this, if you educated your peoples in these technologies, you ought to be able to double productivity in 10 years."

automated test case generation and things like that.

re-basedlined everybody

Looking at things, we found that of most of the time people were producing documents and filling out forms rather than writing computer programs.

So, we make sure that the secretaries would get on this.

DARPA

software engineering

project called Arcadia

Lee [Osterweil]

Dick [Richard] Taylor

a carpenter has a whole toolkit and which tool to use at what point in order to build a house. And a software engineer has all these tools lying around all over the place, and they all do something different, and you, you know, you ought to have a nice box for them. The tools all belong to the right place and the people know what tool to use at what time and in what way. We called it an environment.

the holder, it was the framework for integrating the tools.

design tools, requirements tools, code tools, test tools

IBM, Toronto

Univac, Minneapolis

Boeing, a strategic partnership with Digital Equipment, DEC.

prioritizing

high priority things are the things that you want to test first, and the things you want to inspect first.

CMM [capability maturity model] with Watts Humphrey.

configuration management

requirement management

test management

verification and validation

TRW

creating a new satellite system.

propulsion people

structures people

guidance and control people

communcations people

architectures

defines what the system was going to be like.

after a while the software was really driving the systems

Winston Royce

1970

wrote a definitive paper

it did say that you really want to do some building it twice, so that you know roughly the directions you want to go.

statistical decision theory

prototyping is a form of risk reduction,

Rather than doing a sequence of specifications, you want to be doing a combination of specifications and prototypes.

[International Software] process workshops

first one was in England

we hosted the second one in California

Watts Humphrey.

about determining predictability.

He was not particularly interested in how you build software, just whether your projections about cost and budget, scheduling and budget, could be trusted or not.

motivation for the CMM, was just so that the Defense Department knew which of those lying contrators could be believe, and which ones could not be believed.

there are combinations of agile things that you want to do, and plan-driven things that you want to do.

Rich [Richard] Turner

book, Balancing Agility and Discipline.

there are things where lockstep discipline is not a very good to do, but there are places where coordinating what you're doing is a good thing to do.

Oral History of Barry Boehm, part 2

At one point at TRW, I was on a panel that was saying, “What were the causes of so many missiles getting launched from Vandenberg and then blowing up because of the software?” In most cases, it was because people are responding to change over following a plan and saying, "We've got this fix that, we've got to do, or we've got this telemetry station that's moved and we've got to put a patch in the software. There's not enough time to do the regression testing and the configuration management and following the plan." And so, launched the rocket and boom, there it goes. Responding to change over following a plan may not be good in some situations.

in fact for things that really matter, if your whole bank is going to rest on this or your entire business is going to stand or fall depending on whether this thing works correctly or not, people don't tend to use agile methods.

So I think when things are important people really do fall back on plans and they want to know what they have, and they want to be sure they can trust it.

1958

Hubert Dreyfus, who wrote the book, What computers can't do,

and showed all of the failed prediction that say,

"In 1958, in 10 years, the computer will be the world's chess champion." Well, they got there but not in ten years.

I was getting sort of a balance of skepticism and enthusiasm.

Minuteman command and control system

Montana, North Dakota, and various places.

"Your job as a manager is to manage expectations. Never let people's expectations get out of the box, because if they get out of the box you can never win. If you do those incredible things people will say, 'WEll sure', but most likely you're not going to be able to do those things and people are going to get mad."

so the AI people have made the mistake of over-promising.

in 1955, "pretty soon a computer's going to be the world's best chess player. Computers are going to automatically translate any language into any other language faster than anybody can even think the words."

My own personal belief is that just as every AI boom is bigger than the previous one, every AI bust will be bigger than the previous one, too.

One of my program managers was a Air Force major when I got there, and he got promoted to lieutenant colonel, but he introduced himself as saying, "I am the major cross that you're going to have to bear."

Boehm: He's now the number two guy at Georgia Tech, and he was the director of the Software Engineering Institute and had a really outstanding career. He got some CMU [Carnegie Mellon University]

and MIT people to come up with an AI constraint-based planning approach to solve transportation problems. This was in 1990 and '91, and they came up with a system that could do in four hours using

constraint-based planning what it was taking the clunky transportation command software four days to do. Just about in 1991, we needed to get a half a million people off to the Middle East to fight the first Iraq war. And the transportation command said, “We're confiscating your Sun computers because we need your system to plan all of these things.”

Brock: Wow.

Boehm: They replaced them eventually. <laughs> But fundamentally, this was a key to getting all that stuff there really fast, and a triumph for AI. Steve Cross got the Golden Nugget Award from the

commander of the Air Force and went on from there. So, yes, there were enough examples like that that you can make a case that AI was something that was really going to help.

a lot of organizations didn't want money added to their budget that

they didn't control.

"I get my money from the Chief of Naval Operations and I follow his priorities, and our big priority right now is corrosion. Our boats are getting corroded and we need more research in corrosion technology. And software, I can't really accept your software money. If I get more money I'm going to use it for corrosion."

Dean Leffingwell

Chalmers university in Sweden, Jan Bosch,

T-shape people

Software Management and Economics course, say, “You are the CTO [chief technical officer] of a 500 person software company and your chief executive officer is a concerned about AI, or DevOps, or Artificial Intelligence of various kinds and the like. What you need to do is to give him an analysis of how mature are these, and what are their strengths and what are their weaknesses, and what would we have to do to address these. And that you're going to get graded on how incisive your analysis is, plus the number of different ways that you learn about things. So you can't just go to Google and stop there. You should try to interview some people that are in companies, or are developing this kind of research and things like that. You should look at the proceedings of conferences and the ACM [Association for Computing Machinery]/IEEE [Institute of Electrical and Electronics Engineers] literature, and so the more sources that you do the better your grade is going to be.”

International Conference on Software Engineering (ICSE)

source:

Computer history museum

Oral history of Barry Boehm, part 2 of 2

interviewed by:

David C. Brock

Lee Osterweil

recorded February 20, 2018

____________________________________

Charles Duhigg., The optimists : the full story of microsoft's relationship with OpenAI., The new yorker, Dec. 11

p.33

One day in 2019, an OpenAI vice-president named Dario Amodei demonstrated something remarkable to his peers: he inputted part of a software program into GPT and asked the system to finish coding it. It did so almost immediately (using techniques that Amodei hadn't planned to employ himself). Nobody could say exactly how the A.I. had pulled this off ── a large language model is basically a black box. GPT has relatively few lines of actual code; its answers are based, word by word, on billions of mathematical “weights” that determine what should be outputted next, according to complex probabilities. It's impossible to map out all the connections that the model makes while answering users' questions.

For some within OpenAI, GPT's mystifying ability to code was frightening ── after all, this was the setup of dystopian movie such as “The Terminator”. It was almost heartening when employees noticed that GPT, for all its prowess, sometimes made coding gaffes. Scott and Murati felt some anxiety upon learning about GPT's programming capabilities, but mainly they were thrilled. They'd been looking for a practical application of A.I. that people might actually pay to use ── if, that is, they could find someone within Microsoft willing to sell it.

Five years ago, Microsoft acquired GitHub ── a Web site where users shared code and collaborated on software ── for much the same reason that it invested in OpenAI. GitHub's culture was young and fast-moving, unbound by tradition and orthodoxy. After it was purchased, it was made an independent division within Microsoft, with its own C.E.O. and decision-making authority, in the hope that its startup energy would not be diluted. The strategy proved successful. GitHub remained quirky and beloved by software engineers, and its number of users grew to more than a hundred million.

So Scott and Murati, looking for a Microsoft division that might be excited by a tool capable of autocompleting code ── even if it occasionally got things wrong ── turned to GitHub's C.E.O. Nat Friedman. After all, code posted on GitHub sometimes contained errors; users had learned to work around imperfection. Friedman said that he wanted the tool. GitHub, he noted, just had to figure out a way to signal to people that they couldn't trust the autocompleter completely.

GitHub employees brainstormed names for the product: Coding autopilot, Automated pair programmer, programarama automat. Friednam was an amateur pilot, and he and others felt these names wrongly implied that the tool would do all the work. The tool was more like a co-pilot ── someone who joins you in the cockpit and makes suggestions, while occasionally proposing something off base. Usually you listen to a co-pilot; sometimes you ignore him. When Scott heard Friedman's favored choice for a name ── GitHub copilot ── he loved it. “It perfectly conveys its strengths and weaknesses.”

But when GitHub prepared to launch its Copilot, in 2021, some executives in other Microsoft divisions protested that, because the tool occasionally produced errors, it would damage Microsoft's reputation. “It was a huge fight”, Friednam told me. “But I was the C.E.O. of GitHub, and I knew this was a great product, so I overrode everyone and shipped it.” When GitHub copilot was released, it was an immediate success. “Copilot LITTERALLY BLEW MY MIND”, one user tweeted hours after it was released. “IT'S WITCHCRAFT!!!” another posted. Microsoft began charging ten dollars per month for the app; within a year, annual revenue had topped a hundred million dollars. The division's independence had paid off.

(The new yorker, Dec. 11, 2023, The optimists : the full story of microsoft's relationship with OpenAI., By Charles Duhigg., By Stephen Witt., p.33, )

____________________________________

• data science competitions

• "Data Science", which is the automatic (or semi-automatic) extraction of knowledge from data.;── Yann LeCun (self.MachineLearning).

• the goal of extracting information from data.;── Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman.

• ... and thus discover something about data that will be seen in the future.;── Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman.

• to find predictive patterns in unfamiliar data sets

• a system that not only searches for patterns but designs the feature set [that the pattern is composed of], too.

• But where the teams of humans typically labored over their prediction algorithms for months, the Data Science Machine took somewhere between two and 12 hours to produce each of its entries - predictive patterns in unfamiliar data sets.

• "We view the Data Science Machine as a natural complement to human intelligence," says Max Kanter, whose MIT master's thesis in computer science is the basis of the Data Science Machine.

• feature engineering - identify what variables to extract from the database or compose

• MIT's online-learning platform (MITx) doesn't record either of those statistics, but it does collect data from which [the two crucial indicators] can be inferred.

• data marker

...even if a specific data marker is not included in the data set, it may be included by proxy in a combination of other, relevant data

• Once [the MIT's "Data Science Machine"] produced an array of candidates, ["Data Science Machine" algorithms] reduces their number by identifying those whose values seem to be correlated. Then [the algorithms] starts testing its reduced set of features on sample data, recombining them in different ways to optimize the accuracy of the predictions [the reduced set of features] yield.

• "The Data Science Machine is one of those unbelievable projects where applying cutting-edge research to solve practical problems opens an entirely new way of looking at the problem," says Margo Seltzer, a professor of computer science at Harvard University who was not involved in the work. "I think what they've done is going to become the standard quickly—very quickly."

• October 16, 2015 by Larry Hardesty

• System that replaces human intuition with algorithms outperforms human teams

• http://phys.org/news/2015-10-human-intuition-algorithms-outperforms-teams.html

____________________________________

• October 16, 2015 by Larry Hardesty

• System that replaces human intuition with algorithms outperforms human teams

• http://phys.org/news/2015-10-human-intuition-algorithms-outperforms-teams.html

•

MIT researchers aim to take the human element out of big-data analysis, with a new system that not only searches for patterns but designs the feature set, too. To test the first prototype of their system, they enrolled it in three data science competitions, in which it competed against human teams to find predictive patterns in unfamiliar data sets. Of the 906 teams participating in the three competitions, the researchers' "Data Science Machine" finished ahead of 615

In two of the three competitions, the predictions made by the Data Science Machine were 94 percent and 96 percent as accurate as the winning submissions. In the third, the figure was a more modest 87 percent. But where the teams of humans typically labored over their prediction algorithms for months, the Data Science Machine took somewhere between two and 12 hours to produce each of its entries.

"We view the Data Science Machine as a natural complement to human intelligence," says Max Kanter, whose MIT master's thesis in computer science is the basis of the Data Science Machine. "There's so much data out there to be analyzed. And right now it's just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving."

Between the lines

Kanter and his thesis advisor, Kalyan Veeramachaneni, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), describe the Data Science Machine in a paper ...

Veeramachaneni co-leads the Anyscale Learning for All group at CSAIL, which applies machine-learning techniques to practical problems in big-data analysis, such as determining the power-generation capacity of wind-farm sites or predicting which students are at risk for dropping out of online courses.

"What we observed from our experience solving a number of data science problems for industry is that one of the very critical steps is called feature engineering," Veeramachaneni says. "The first thing you have to do is identify what variables to extract from the database or compose, and for that, you have to come up with a lot of ideas."

In predicting dropout, for instance, two crucial indicators proved to be how long before a deadline a student begins working on a problem set and how much time the student spends on the course website relative to his or her classmates. MIT's online-learning platform MITx doesn't record either of those statistics, but it does collect data from which they can be inferred.

Featured composition

Kanter and Veeramachaneni use a couple of tricks to manufacture candidate features for data analyses. One is to exploit structural relationships inherent in database design. Databases typically store different types of data in different tables, indicating the correlations between them using numerical identifiers. The Data Science Machine tracks these correlations, using them as a cue to feature construction.

For instance, one table might list retail items and their costs; another might list items included in individual customers' purchases. The Data Science Machine would begin by importing costs from the first table into the second. Then, taking its cue from the association of several different items in the second table with the same purchase number, it would execute a suite of operations to generate candidate features: total cost per order, average cost per order, minimum cost per order, and so on. As numerical identifiers proliferate across tables, the Data Science Machine layers operations on top of each other, finding minima of averages, averages of sums, and so on.

It also looks for so-called categorical data, which appear to be restricted to a limited range of values, such as days of the week or brand names. It then generates further feature candidates by dividing up existing features across categories.

Once it's produced an array of candidates, it reduces their number by identifying those whose values seem to be correlated. Then it starts testing its reduced set of features on sample data, recombining them in different ways to optimize the accuracy of the predictions they yield.

"The Data Science Machine is one of those unbelievable projects where applying cutting-edge research to solve practical problems opens an entirely new way of looking at the problem," says Margo Seltzer, a professor of computer science at Harvard University who was not involved in the work. "I think what they've done is going to become the standard quickly—very quickly."

•

• http://phys.org/news/2015-02-tackles-biggest-bottlenecks-science-industry.html

• Researcher tackles some of the biggest bottlenecks holding back the data science industry

• February 25, 2015 by Eric Brown

•

____________________________________

“... the arrival of AI will not be any more or any less disruptive than the arrival of indoor plumbing, vaccines, the car, air travel, the television, the computer, the internet, etc.“.;── Yann LeCun (self.MachineLearning), http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun

data science/ DS/ machine learning/ ML/ unsupervised feature learning/

unsupervised learning/ computer science/ CS/ science fiction/ SF/ sci-fi/

fantasy/ fa/ fiction/ fi/ Finland/ fi/ reinforcement learning/ RL/

deep learning/ DL/ artificial intelligence/ AI/ expert systems/

representation learning/ RL/

AMA: Yann LeCun (self.MachineLearning)

http://www.reddit.com/r/MachineLearning/comments/25lnbt/ama_yann_lecun

reinforcement learning uses Q-learning (a very classical algorithm for RL)

convolutional network (a now very classical method for image recognition)

The DeepMind video-game player that trains itself with reinforcement learning uses Q-learning (a very classical algorithm for RL) on top of a convolutional network (a now very classical method for image recognition). One of the authors is Koray Kavukcuoglu who is a former student of mine.

<----------------------------------------------------------------->

http://arxiv.org/abs/1312.5602

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller

(Submitted on 19 Dec 2013)

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards. We apply our method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm. We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Comments: NIPS Deep Learning Workshop 2013

Subjects: Learning (cs.LG)

in December, DeepMind published a paper showing that its software could do that by learning how to play seven Atari2600 games using as inputs only the information visible on a video screen, such as the score. For three of the games, the classics Breakout, Enduro, and Pong, the computer ended up playing better than an expert human. It performed less well on Q*bert and Space Invaders, games where the best strategy is less obvious.

<----------------------------------------------------------------->

to make sure people like Vladimir could work on their research with minimal friction and distraction.

Deep learning has become the dominant method for acoustic modeling in speech recognition, and is quickly becoming the dominant method for several vision tasks such as object recognition, object detection, and semantic segmentation.

The next frontier for deep learning are language understanding, video, and control/planning (e.g. for robotics or dialog systems).

I believe there is a role to play for specialized hardware for embedded applications. Once every self-driving car or maintenance robot comes with an embedded perception system, it will make sense to build FPGAs, ASICs or have hardware support for running convolutional nets or other models.

"Data Science", which is the automatic (or semi-automatic) extraction of knowledge from data.

Otherwise, the order in which we learn things would not matter. Obviously, the order in which we learn things does matter (that's why pedagogy exists). The famous developmental psychologist Jean Piaget established that children learn simple concepts before learning more complex/abstract ones on top of them.

There are four main uses for unsupervised learning: (1) learning features (or representations); (2) visualization/exploration; (3) compression; (4) synthesis. Only (1) is interesting to me (the other uses are interesting too, just not on my own radar screen).

Theses are folks who have long been interested in representing data (mostly natural signals like audio and images). These are people who have worked on wavelet transforms, sparse coding and sparse modeling, compressive sensing, manifold learning, numerical optimization, scientific computing, large-scale linear algebra, fast transform (FFT, Fast Multipole methods). This community has a lot to say about how to represent data in high-dimensional spaces.

The questions become: how well does my method work on this particular problem, and how large is the set of problems on which it works well.

It's important to keep mind that the arrival of AI will not be any more or any less disruptive than the arrival of indoor plumbing, vaccines, the car, air travel, the television, the computer, the internet, etc.

establishing causal relationships is a hugely important problem in data science. There are huge applications in healthcare, social policy....

http://code.madbits.com/wiki/doku.php?id=tutorial_basics

For a long time, speech recognition has stagnated because of the dictatorship of results on benchmarks. The barrier to entry was very high, and it was very difficult to get state-of-the-art performance with brand new methods.

There has to be a process by which innovative ideas can be allowed to germinate and develop, and not be shut down before they get a chance to produce good results.

It's very useful for time series prediction. Alex Graves (from Deep Mind) has quite a few nice papers on applying neural networks to time series though most of his work is focused on classification rather than forecasting.

Learning with temporal/sequential signals: language, video, speech.

Marrying deep/representation learning with reasoning or structured prediction.

In the early days of aviation, some people (like Clément Ader) tried to copy birds and bats a little too closely (without understanding the principles of lift, drag, and stability) while others (like the Wright Brothers and Santos-Dumont) had a more systematic engineering approach (building a wind tunnel, testing airfoils, building full-scale gliders....). Both were somewhat inspired by nature, but to different degrees. My problem with sticking too close to nature is that it's like "cargo-cult" science. A bird biologist will tell you how important the micro-structure of feathers is to bird flight. You will think that you need to reproduce feathers in their most minute details to build flying machines. In reality, flight relies on the Bernoulli principle: pushing an angled plate (preferably shaped like an airfoil) through air creates lift. I don't use neural nets because they look like the brain. I use them because they are a convenient way to construct parameterized non-linear functions with good properties. But I did get inspiration from the architecture of the visual cortex to build convolutional nets.

____________________________________

Joshua Cooper Ramo (author), The seventh sense (book), 2016

pp.276-80

Pattie Maes

p.276

When I first met her, in the 1990s, she was in charge of much of the work on artificial intelligence (AI) at MIT's Media Lab, Danny Hillis's old home.

p.276

he introduced me to a puzzle of her field that has stayed on my mind in the year since. It is called the disappearing AI problem.

p.276

Back in the 1990s, ..., Maes and her team were tinkering with what was known as computer-aided prediction.

pp.276-277

Maes intended to design a computer that could ask, for instance, what movie stars you like. “Robert Redford”, you'd type. And then the machine would spit back some films you might enjoy. The Paul Newman classic Cool Hand Luke, for instance.

p.277

And, well, you had liked that film. This seemed magic, just the sort of data-meets-human question that showcased a machin learning and thinking. An honestly artificial intelligence. Maes hoped to design a computer that could predict what moview or music or books you or I might enjoy. (And, of course, buy.)

p.277

A recommendation engine.

p.277

But to confidently bridge your knowledge of a friend's taste and the nearly endless library of moview and songs and books? Beyond human capacity. It seemed an ideal job for a thoughtful machine.

The traditional approach to such a problem was to devise a formula that would mimic your friend. What are his hobbies? What areas interest him? What cheers him up? Then you'd program a machine to jump just as deep into movies and music and books, to break them down by plot and type of character to see what might fit your friend's interests.

p.277

But after years building programs that tried ── and failed ── to tackle the recommendation problem in this fashion, the MIT group changed tack.

p.277

Instead of teaching a machine to understand you (or Tolstoy), they simply began compiling data about what movies and music and books people liked. Then they looked for patterns. People were not, they discoverd, all that unique.

p.277

Pretty much everyone who liked Redford in Downhill Racer loved Newman in The Hustler. Anyone who enjoyed Radiohead's Kid A could be directed safely to Sigur Rós's Ágaetis Byrjun.

pp.277-278

Maes and her team found themselves, as a result, less focused on the mechanics of making a machine think than on devising formulas to organize, store, and probe data.

p.278

What had begun as a problem of artificial intelligence became, in the end, a puzzle of mathematics.

p.278

The mystery of human thought, that great, unknowable sea of chemicals and instinct and experience that would have let you place your finger on just the song to open the heart of your date, had been unlocked by data. Here was the disappearing AI problem. A puzzle that looked like it needed computer intelligence demanded, in the end, merely math. The AI had disappeared.

p.278

Many problems that once seemed to demand the miracle of thought really only needed data.

Joshua Cooper Ramo, The seventh sense: power, fortune, and survival in the age of network, 2016.

____________________________________

── Now, with the Aardvark data: Romney insisted the revisions were the result not of systemic errors but of getting more data. He had been relying on historical data of large Soviet nuclear tests and extrapolating down to make estimates about the detection of smaller tests, which might be confused with earthquakes. “The change came about as a result of additional information we got”, Romney insisted., Sharon Weinberger, The imagineers of war : the untold history of DARPA, the pentagon agency that changed the world, 2017, [p.390]

Sharon Weinberger, The imagineers of war : the untold history of DARPA, the pentagon agency that changed the world, 2017

p.102

He had been arguing that it would be difficult to distinguish small underground nuclear tests from earthquakes, which would make verifying a nuclear test ban treaty difficult, if not impossible.

Now, with the Aardvark data, he knew he had been wrong on a key point.

During a July 3, 1962, meeting, Romney announced that the new seismic data let him to conclude that distinguishing between tremors and small nuclear tests might not be as difficult as he had previously thought.

102 Now, with the Aardvark data: Romney insisted the revisions were the result not of systemic errors but of getting more data. He had been relying on historical data of large Soviet nuclear tests and extrapolating down to make estimates about the detection of smaller tests, which might be confused with earthquakes. “The change came about as a result of additional information we got”, Romney insisted. Romney, interview with the author. [p.390]

(The imagineers of war : the untold story of DARPA, the Pentagon agency that changed the world / by Sharon Weinberger., New York : Alfred A. Knopf, 2017, united states. defense advanced research projects agency──history. | military research──united states. | military art and science──technological innovations──united states. | science and state──united states. | national security──united states──history. | united states──defenses──history., U394.A75 W45 2016 (print) | U394.A75 (ebook) | 355/.040973, 2017, )

Sharon Weinberger, The imagineers of war : the untold history of DARPA, the pentagon agency that changed the world, 2017

pp.99-104

p.99

ARPA was assigned nuclear test detection under the code name Vela at the end of 1959 as a counterweight to the CIA's and the air force's secret test detection network. ARPA got the work, quite simply, because President Eisenhower did not trust his spooks and wanted an assessment that was independent of the CIA and its assets.

p.99

brought renewed focus and funding to the Vela test detection program.

By 1961, Vela had three parts:

Vela Uniform, to detect underground nuclear tests;

Vela Sierra, to detect nuclear explosions in the atmosphere; and

Vela Hotel, which would launch satellites with sensors to detect nuclear tests from space.

99 Vela had three parts: The two most significant parts of Vela ended up being Vela Hotel and Vela Uniform. Vela Sierra, which involved ground-based sensors to detect nuclear tests in space, was eventually folded into Vela Hotel. Some of the Vela work, it turns out, did not really require any exotic science. For example, detecting underwater explosions required little new research. ARPA conducted some underwater tests using conventional explosives under the code name CHASE, short for “cut holes and sink 'em”. Huff and Sharp, Advanced Research Projects Agency, VII-15. “The ocean detection system was a nonproblem”, Frosch said. Frosch, interview with author. [p.390]

p.99

The academic discipline of seismology, at the time, was a backwater. Robert Frosch, who was recruited to ARPA to run Vela, recalled going with the director, Robert Sproull, to visit what was supposed to be a start-of-the-art seismic vault, one of the underground bunker-like structures that were used to measure tremors. The two men came out of the vault in shock, feeling as if they had just emerged from a time capsule. The seimologists there were using pen recorders and primitive galvanometers, an analog instrument used to measure electrical current.

p.99

Vela began to change that with an influx of funding for seismology that was almost unimaginable in scale for most areas of science. The military's need to distinguish earthquakes from nuclear tests brought seismology “kicking and screaming” into the 20th century, according to Frosch. At one point, he said, he funded almost “every seismologist in the world, except for two Jesuits at Fordham university” who refused to take money from Pentagon.

p.100

Large Aperture Seismic Array, or LASA,

a massive nuclear detection system that comprised 200 “seismic vaults” buried across a 200-kilometer-diameter area in the eastern half of Montana. For it to work, more than a dozen of these enormous sites would have to be constructed around the world to monitor the Soviet Union.

There had been smaller arrays, including one in the United Kingdom,

The air force hated the idea,

p.100

Bilings, Montana

What was amazing about LASA, according to Frosch, was the scale of the work, which was completed in just 18 months, a schedule unimaginable for government projects that typically take years, if not decades.

When ARPA needed to have a center where all the seismic data could be collected and analyzed, the agency ended up renting space in downtown Billings, where data from the array was routed to an IBM computer.

p.100

ARPA also began funding the placement of seismograph stations around the world that were operated by scientists.

pp.100-101

the CIA and the air force, who up to that point had a monopoly on advice to political leaders about what was theoretically possible to monitor a [nuclear explosion] test ban.

p.101

local scientists only needed to agree to operate them and share the data.

p.101

a growing tension between secret and open research

p.102

air force and the CIA refused to release data from their network of sensors.

bête noire - Fr. Anything that is an object of hate or dread; a bugaboo. [< F, black beast]

p.102

The bête noire of the nuclear detection would was Carl Romney, a scientist who worked for the Air Force Technical Application Center, or AFTAC, the agency responsible for nuclear test detection.

p.102

Whether deliberate or not, the problem with secret data, as Ruina pointed out, was that “nobody could argue with it; they could just question it.” The secret data problem came to a head in 1962, when the United States carried out a test called Aardvark, a part of the first series of tests conducted completely underground.

p.102

Aardvark, a 40-kiloton nuclear device intended for nuclear artillery, produced reliable seismographic data on a nuclear underground explosion, and Romney suddenly realized he had been wrong about a critical national security issue.

p.102

He had been arguing that it would be difficult to distinguish small underground nuclear tests from earthquakes, which would make verifying a nuclear test ban treaty difficult, if not impossible.

Now, with the Aardvark data, he knew he had been wrong on a key point.

p.102

it would look as if the government were “withholding information that would tend to ease the inspection problem in a nuclear test ban.”

pp.102-103

Ruina called it an “honest mistake”, but one that would have been avoided if other scientists had been given access to the classified data that Romney jealously guarded. “This is what can happen when you have one person interpreting data, there's no peer group reviewing it, and there's nobody duplicating the experiment”, the ARPA director wrote in a three-page letter, blaming the mistake on secrecy.

p.103

Glenn Seaborg, chairman of the Atomic Energy Commission

played a key role in test ban negotiations.

“VELA seemed to indicate that the detection capability was better than had been thought by American experts in the period from 1959 to 1961”, Seaborg wrote in his memoir detailing the negotiations.

____________________________________

Joshua Cooper Ramo (author), The seventh sense (book), 2016

p.279

You and I might be able to spot patterns in movie habits, given enough time, but as more complex problems emerge, as a world of a trillion connected points becomes a sea of data to examine, there is no chance we'll match the machines.

pp.282-283

• predictive learning (AI systems design) and

• representation learning (AI systems design)

The AI systems designer Roger Grosse has named two paths to this sort of wired sensibility: predictive learning and representation learning. That first approach is what Mae's movie machine pursued. The computer is simply checking what it encounters against a database. It teaches itself to predict based on what has been seen before. This sort of knowledge begins with massive amounts of data and then hunts for patterns, tests their reliability, and improves by mapping quirks and similarities.

p.283

Google engineers have a device that can gaze into a human eye and spot signs of impending optical failure. Is the machine smarter than your ophthalmologist? Hard to know, but let's just say this: It has seen, studied, and compared millions of eyes to find patterns that nearly perfectly predict a diagnosis. It can review in seconds more cases than your doctor will see in a lifetime ── let alone recall and compare at submillimeter accuracy. Fast, thorough predictive algorithms make what might once have been regarded as AI disappear. The machine isn't all that wise; it just knows a lot.

p.283

On the other path, the one of representation learning, the machine uses a self-sketched image of the world, a “representation”. Say you wanted a computer to identify a restaurants with outdoor seating. A predictive system might be told, Look for pictures in which a third of the pixels are sky colored. You can see how such a primitive approach might be limited. But a representation-based program would use a neural network to examine thousands of photos ── such a collection is called “training data” ── of restaurant patios. It would develop its own sense of what makes these images special: sunlight glinting off glasses, sky reflected in silverware. It would assemble, bit by bit, an accurate feeling for the features of an outdoor dining space. And over time, it could aspire to near-perfect fidelity.

p.284

Faces, disease markers, obscure sounds

p.284

Today, basic versions of representational AI can study a map and name the most important roads. They can predict cracks in computer networks days before a fault. Representation-based programs take longer to train, as you might expect. But these training times are getting shorter. And though representational AIs are harder to program ── and they demand almost unimaginable amounts of computing power ([ and unimaginable amounts of [label?] data to reach the degree of accuracy and reliability to make the program practical ]) ── they product a subtle, lively kind of insight.

p.284

A machine with a prediction-based understanding of classical music can listen to a clip of a symphony and name it. One with a representation-based understanding of, say, Mozart's forty-one symphonies can write you an extremely convincing forty-second symphony ── or, if you wish, an even earlier First Symphony, based on what it knowns about Mozart's evolution as a composer. It can do it again and again. In seconds.

Joshua Cooper Ramo, The seventh sense: power, fortune, and survival in the age of network, 2016.

____________________________________

Albert-László Barabási, BURSTS, 2010 [ ]

[pp.171-172]

the patterns of human mobility

... About a year after the publication of my first book on networks I had grown used to e-mails and calls from readers seeking advice on inter-connected systems. This was one of the few times that someone had called not to ask but to give. He had my full attention.

The caller was a high-ranking executive at a mobile-phone consortium who'd recognized the value in having records of who is talking with whom. After reading 'Linked' he had become convinced that social networking was essential to improving services for his consumers. So he offered access to their anonymized data in exchange for any insights our research group might provide.

His intuition proved correct: My group and I soon found the mobile users' behavior patterns to be so deeply affected by the underlying social network that the executive ordered many of his company's business practices redesigned, from marketing to consumer retention. With that, he pioneered a trend that over the past few years has swept most mobile carriers, triggering an avalanche of research into mobile communications. Despite his crucial role in advancing network thinking in the mobile industry, his combination of modesty and caution prevented his ever wanting his name attached to any of it.

As my group and I immersed ourselves in the intricacies of mobile communications, we came to understand that mobile phones not only reveal who our friends are but also capture our whereabouts. Indeed, each time we make a call the carrier records the tower that communicates with our phone, effectively pinpointing our location. This information is not terribly accurate, as we could be anywhere within the tower's reception area, which can span tens of square miles. Furthermore, our location is usually recorded only when we use our phone, providing ... information about our whereabouts between calls.

Despite these contraints, the data offered an exceptional opportunity to explore the mobility of millions of individuals.

[pp.193-195]

... As a result we tend to romanticize college life, the cradle of youth culture, seeing students as perhaps the most spontaneous and thus least predictable segment of the population. Yet Sandy Pentland, an MIT professor who follows the chatter of hundreds of students every day, finds that concept preposterous.

In the early 1990s Pentland started a research program in wearable computing at the Media Lab at MIT, prompted by the the realization that, given the rate at which computers were shrinking, we soon would want to have them with us all the time. Sandy's vision of the future proved remarkably accurate, as today computers have become a part of our wardrobe, fashion accessories of a kind. In fact, for the most part we have stopped even calling them computers. We refer to them simply as smart phones.

In the fall of 2002 Nathan Eagle, a doctoral student in Sandy's lab, offered one hundred MIT students free Nokia smart phones, a desirable top-of-the-line gadget at the time. This was no handout, however; the catch was that the phones collected everything they could about their owners: whom they called and when, how long they chatted, where they were, and who was nearby. By the end of the year-long experiment, Nathan Eagle and Sandy Pentland had collected about 450,000 hours of data on the communication, whereabouts, and behavior of seventy-five Media Lab faculty and students and twenty-five freshmen from MIT's Sloan School of Management.

Trying to make sense of his data, Nathan arranged each student's whereabouts into three groups: home, work,and "elsewhere," the latter category assigned when they were neither at home nor at work but jogging along the Charles River or partying at a friend's house. Then he developed an algorithm to detect repetitive patterns, quickly discovering that on weekdays the students were mainly at home between the hours of ten P.M. and seven A.M. and at the university between ten A.M. and eight P.M. Their behavior changed slightly only during the weekends, when they showed an inclination to stay home at late as ten A.M.

None of these patterns would shock anybody familiar with graduate student life. But the level of predictability of their routines was still remarkable. Nathan found that if he knew a business-school student's morning location he could predict with 90 percent accuracy the student's afternoon whereabouts. And for Media Lab students, the algorithm did even better, predicting their whereabouts 96 percent of the time. ([ we are creatures of habits ])

It is tempting to see life as a crusade against randomness, a yearning for a safe, ordered existence. If so, the students excelled at it, ignoring the roll of the dice day after day. Indeed, Nathan's algorithm failed to predict their whereabouts only twice a week, during rare hours of rebellion when they finally lived up to our expectation that they be wild and spontaneous. Yet the timing of these unpredictable moments was by no means random--they were the typical party times, the Friday and Saturday nights. The rest of the week, twenty-two out of twenty-four hours a day, the students were neither the elusive Osama bin Laden nor the ubiquitously erratic Britney Spears but intead dutifully trod the deeply worn grooves of their lives. So maybe the Harlequins were onto something when they insisted on using an RNG(Random Number Generator). Had they studies at MIT, their whereabouts would have been no mystery--not to Nathan, nor to the Vast Machine.

But we may yet avert the dawn of an Orwellian world as described in 'The Traveler'. For me, this sense of hopefulness emerged in the summer of 2007 when I purchased a brick-sized wristwatch. It was a loud antifashion statement and doubled as a GPS device, which recorded my precise location every few seconds. After I had worn it for several months, Zehui Qu, a visiting computer-science student, applied Nathan Eagle and Sandy Pentland's predictive algorithm to the data collected by my GPS. Sure enough, after a few days of training, Qu was able predict my whereabouts with 80 precent accuracy.

While the algorithm's performance was impressive, the persistent gap between the 96 percent predictability Nathan found amoung the MIT students and my 80 percent raise a red flag. Neither I nor the MIT students were a fair representation of the population at large. Marta's study of the mobile-phone records had already explained why: When it comes to our travel patterns, we are hugely different. Some, like the MIT students and myself, are relatively home- and office-bound. Others are outliers, however, and travel a lot, tending to be less localized.

So does that mean there are people out there who are far less predictable than the MIT students and I? Truck drivers, perhaps, who travel the country for weeks at a time? Soccer moms, whose minivans shuttle between piano and fencing lessons? What about super-traveler Hasan Elahi, whose "suspicious movements" will undoubtedly land him in hot water again? How different are they from you and me? Are there Harlequins among us, individuals whose lives are driven by the roll of the dice to such a degree that their movements are impossible to foresee?

* The difference between human dynamics and data-mining boils down to this: Data mining predicts our behaviors based on records of our patterns of activity; we don't even have to understand the origins of the patterns exploited by the algorithm. Students of human dynamics, on the other hand, seek to develop models and theories to explain why, when, and where we do the things we do with some regularity.

____________________________________

„Machine learning is a mathematical technique for training computer systems to make accurate predictions from a large corpus of training data, with a degree of accuracy that in some domains can mimic human cognition.“

—— Maciej Ceglowski,

May 7, 2019,

US Senate Committee on Banking, Housing, and Urban Affairs

on Privacy Rights and Data Collection in a Digital Economy

<< long read - scroll down to skip this section >>

Maciej Ceglowski's Senate testimony on Privacy Rights and Data Collection in a Digital Economy

May 7, 2019,

Senate Committee on Banking, Housing, and Urban Affairs

Privacy Rights and Data Collection in a Digital Economy (Senate hearing)

privacy

pinboard

regulation

gdpr

long read

https://idlewords.com/talks/senate_testimony.2019.5.htm

Consent in a world of inference

For example, imagine that an algorithm could inspect your online purchasing history and, with high confidence, infer that you suffer from an anxiety disorder. Ordinarily, this kind of sensitive medical information would be protected by HIPAA, but is the inference similarly protected? What if the algorithm is only reasonably certain? What if the algorithm knows that you’re healthy now, but will suffer from such a disorder in the future?

The question is not hypothetical—a 2017 study showed that a machine learning algorithm examining photos posted to the image-sharing site Instagram was able to detect signs of depression before it was diagnosed in the subjects, and outperformed medical doctors on the task.

Addendum: Machine Learning and Privacy

Machine learning is a mathematical technique for training computer systems to make accurate predictions from a large corpus of training data, with a degree of accuracy that in some domains can mimic human cognition.

For example, machine learning algorithms trained on a sufficiently large data set can learn to identify objects in photographs with a high degree of accuracy, transcribe spoken language to text, translate texts between languages, or flag anomalous behavior on a surveillance videotape.

The mathematical techniques underpinning machine learning, like convolutional neural networks (CNN), have been well-known since before the revolution in machine learning that took place beginning in 2012. What enabled the key breakthrough in machine learning was the arrival of truly large collections of data, along with concomitant [accompanies or is collaterally connected with] computing power, allowing these techniques to finally demonstrate their full potential.

It takes data sets of millions or billions of items, along with considerable computing power, to get adequate results from a machine learning algorithms. Before the advent of the surveillance economy, we simply did not realize the power of these techniques when applied at scale.

Because machine learning has a voracious appetite for data and computing power, it contributes both to the centralizing tendency that has consolidated the tech industry, and to the pressure companies face to maximize the collection of user data.

Machine learning models poses some unique problems in privacy regulation because of the way they can obscure the links between the data used to train them and their ultimate behavior.

A key feature of machine learning is that it occurs in separable phases. An initial training phase consists of running a learning algorithm on a large collection of labeled data (a time and computation-intensive process). This model can then be deployed in an exploitation phase, which requires far fewer resources.

Once the training phase is complete, the data used to train the model is no longer required and can conceivably be thrown away.

The two phases of training and exploitation can occur far away from each other both in space and time. The legal status of models trained on personal data under privacy laws like the GDPR, or whether data transfer laws apply to moving a trained model across jurisdictions, is not clear.

Inspecting a trained model reveals nothing about the data that went into it. To a human inspecting it, the model consists of millions and millions of numeric weights that have no obvious meaning, or relationship to human categories of thought. One cannot examine an image recognition model, for example, and point to the numbers that encode ‘apple’.

The training process behaves as a kind of one-way function. It is not possible to run a trained model backwards to reconstruct the input data; nor is it possible to “untrain” a model so that it will forget a specific part of its input.

Machine learning algorithms are best understood as inference engines. They find structure and excel at making inferences from data that can sometimes be surprising even to people familiar with the technology. This ability to see patterns that humans don’t notice has led to interest in using machine learning algorithms in medical diagnosis, evaluating insurance risk, assigning credit scores, stock trading, and other fields that currently rely on expert human analysis.

The opacity of machine learning models, combined with this capacity for inference, also make them an ideal technology for circumventing legal protections on data use. In this spirit, I have previously referred to machine learning as “money laundering for bias”. Whatever latent biases are in the training data, whether or not they are apparent to humans, and whether or not attempts are made to remove them from the data set, will be reflected in the behavior of the model.

A final feature of machine learning is that it is curiously vulnerable to adversarial inputs. For example, an image classifier that correctly identifies a picture of a horse might reclassify the same image as an apple, sailboat or any other object of an attacker’s choosing if they can manipulate even one pixel in the image. Changes in input data not noticeable to a human observer will be sufficient to persuade the model. Recent research suggests that this property is an inherent and ineradicable feature of any machine learning system that uses current approaches.

In brief, machine learning is effective, has an enormous appetite for data, requires large computational resources, makes decisions that resist analysis, excels at finding latent structure in data, obscures the link between source data and outcomes, defies many human intuitions, and is readily fooled by a knowledgeable adversary.

—Maciej Ceglowski, 2019

source:

https://tildes.net/~tech

____________________________________

Copy (cut) and Paste Text #RLA-POST

By Russell L. Ackoff

post industrial revolution

Written by Russell L. Ackoff

pp.24-25

The conversion of the Industrial Revolution into what has come to be called the Post industrial Revolution has it origins in the last century. Scientists who explored the use of electricity as a source of energy found that it could not be observed easily. Therefore, they developed such instruments as the amp-meter, ohm-meter, and volt-meter to observe IT for them. The development of instruments exploded in this century, particularly after the advent of electronics and sonar and radar. Look at the dashboard of a large commercial airplane, or even one in an automobile. These intruments GENERATE SYMBOLS that represent the properties of objects or events. Such symbols are called DATA. Instruments, therefore, are observing devices, but they are not machines in the Machine-Age sense because they do not apply energy to matter in order to transform it. The technology of instrumentation is fundamentally different from that of mechanization.

Another technology with this same characteristic emerged when the telegraph was invented in the last century. It was followed by the telephone, wireless, radio, television, and so on. This technology, like that of instrumentation, has nothing to do with mechanization; it has to do with the TRANSMISSION OF SYMBOLS, or COMMUNICATION.

The technologies of observation and communication formed the two sides of a technological arch that could not carry any weight until a keystone was dropped into place. This did not occur until the 1940s when the computer was developed. It too did no work in the Machine-Age sense; it manipulated SYMBOLS logically, which, as John Dewey pointed out, is the nature of THOUGHT. It is for this reason that the computer is often referred to as a thinking machine.

Because the computer appeared at a time when we had begun to put things back together again, and because the technologies of observation, communication, and computation all involve the manipulation of symbols, people began to consider systems that combine these three functions. They found that such systems could be used to control other systems, to automate. Automation is fundamentally different from mechanization. Mechanization has to do with the replacement of MUSCLE; automation with the replacement of MIND. Automation is to the Post industrial Revolution what mechanization was to the Industrial Revolution.

Automations are certainly not machines in the Machine-Age sense, and they need not be purposeless. It was for this reason that they came to be called teleological mechanisms. However, automation is no more as essential ingredient of the systems approach than is high technology in general. Both come with the System Age and are among its producers as well as its products. The technology of the Post industrial Revolution is neither a panacea nor a plague; it is what we make of it. It generates a host of problems and possibilities that systems thinking must address. The problems it generates are highly infectious, particularly to less-technologically developed cultures. The system approach provides a more effective way than previously has been available for dealing with both the problems and the possibilities generated by the Post industrial Revolution, but it is by no means limited to this special set of either or both.

____________________________________

drones

fix site

mobile drones

space

ground (autombile drone, like Knight rider (television series))

ground (street)

ground (sidewalk)

ground (tree climbing)

ground (legs)

ground (extreme cold)

under ground

water (surface)

water (under water)

water (ocean, extreme depth)

water (ocean, storm condition)

air (high altitude)

air (ground hugging)

air (aircraft)

mobile drones (robot) that can recharged itself

motors

autonomous

semi-autonomous

remote control (human)

remote control

remote control toys

remote control cars

remote control boat

remote control aircraft

robotic arm

rifles, hand gun, scope

cameras

camera phone

video phone

microphone, speaker

wheel, track

battery power (electricity)

mobile phone

communication

remote control

frequency hopping

time division multiplexing

adaptive frequency hopping

used mobile phone (computing power)

leverage mobile communication infrastructure

mobile phone as a remote control computing platform

RADAR jammer

digital radio frequency memory chip

DRFM jammer

playstation (video gaming computing machine)

general purpose platform

application specific platform

navigation, communication

you needed ways to navigate and communicate.

____________________________________

Palo alto : a history of california, capitalism, and the world

by Malcolm Harris

p.186

During the interwar period, a plane all by itself wasn't much more than a toy; to do anything purposeful with it, you needed ways to navigate and communicate. From the beginning, planes relied on ground and onboard electronics systems to guide them. These systems were collectively called avionics

Palo alto : a history of california, capitalism, and the world

by Malcolm Harris

____________________________________

Nassim Nicholas Taleb, Fooled by Randomness, 2nd edition, hardcover, 2004 [ ]

ergodicity, 57-58, 96, 156-57, 254

p.96

on average, animal will be fit, but not every single one of them, and not at all times.

Just as an animal could have survived because its sample path was lucky, the “best” operators who survived because of overfitness to a sample path ── a sample path that was free of the evolutionary rare event.

One vicious attribute is that the longer these animals can go without encountering the rare event, the more vulnerable they will be to it.

We said that should one extend time to infinity, then, by ergodicity, that event will happen with certainty ── the species will be wipe out!

For evolution means fitness to one and only one time series, not the average of all the possible environments.

(Taleb, Nassim (2004)., Fooled by Randomness, 2nd edition, hardcover)

(Fooled by Randomness: the hidden role of chance in life and in the markets / Nassim Nicholas Taleb, 1. investments, 2. chance, 3. random variables, 123.3 Taleb, )

____________________________________

• most medical doctor are trained to look for strong features when making a diagnosis, because ...

• if you can overcome the Garbage In, Garbage Out (GIGO) problem; the machine learning (also refer to as Artificial Intelligence [AI] in mainstream articles) algorithm that has been trained to detect a specific type of cancer would look at the strong and the weak features ...

Kai-Fu Lee., AI superpowers: China, Silicon Valley and the new world order, 2018

pp.190-191

My first doctor classified the disease as stage IV, the cancer's most advanced stage. On average, patients with 4th-stage lymphoma of my type have around a 50 percent shot surviving the next five years. I wanted to get a second opinion before beginning treatment, and a friend of mine arranged for me to consult his family doctor, the top hematology practitioner in Taiwan.

It would be a week before I could see that doctor, and in the meantime I continued to conduct my own research on the disease.

p.190

But as a trained scientist whose life hung in the balance, I couldn't help trying to better understand the disease and quantify my chances of survival.

p.190

lymphoma: possible causes, cutting-edge treatment, and long-term survival rates. Through my reading, I came to understand how doctors classify the various stages of lymphoma.

pp.190-191

Medical textbooks use the concept of “stages” to describe how advanced cancerous tumors are, with later stages generally corresponding to lower survival rates. In lymphoma, the stage has traditionally been assigned on the basis of a few straightforward characteristics: Has the cancer affected more than one lymph node? Are the cancerous lymph nodes both above and below the diaphragm (the bottom of the rib cage)? Is the cancer found in organs outside the lymphatic system or in the patient's bone marrow? Traditionally, each answer of “yes” to one of the above questions bumps the diagnosis up a stage. The fact that my lymphoma had affected over twenty sites, had spread above and below my diaphragm, and had entered an organ outside the lymphatic system meant that I was automatically categorized as a stage IV patient.

p.191

But what I didn't know at the time of diagnosis was that this crude method of staging has more to do with what medical students can memorize than what modern medicine can cure.

p.191

Ranking stages based on such simple characteristics of a complex disease is a classic example of the human need to base decisions on “strong features”. Humans are extremely limited in their ability to discern correlations between variables, so we look for guidance in a handful of the most obvious signifiers. In making bank loans, for example, these “strong features” include the borrower's income, the value of the home, and the credit score. In lymphoma staging, they simply include the number and location of the tumors.

p.191

These so-called strong features really don't represent the most accurate tools for making a nuanced prognosis, but they're simple enough for a medical system in which knowledge must be passed down, stored, and retrieved in the brains of human doctors.

p.191

Medical research has since identified dozens of other characteristics of lymphoma cases that make for better predictors of five-year survival in patients. But memorizing the complex correlations and precise probabilities of all these ictors is more than even the best medical students can handle. As a result, most doctors don't usually incorporate these other predictors into their own staging decisions.

p.191

In the depths of my own research, I found a research paper that did quantify the predictive power of these alternate metrics. The paper is from a team of researchers at the University Modena and Reggio Emilia in Italy, and it analyzed fifteen (15) different variables, identifying the five (5) features that, considered together, most strongly correlated to five-year survival.

pp.191-192

These features included some traditional measures (such as bone marrow involvement) but also less intuitive measures (are any tumors over 6 cm in diameter? Are hemoglobin levels below 12 grams per deciliter? Is the patient over 60?). The paper then provides average survival rates based on how many of those features a patient exhibited.

p.192

this new decision rubric still seemed far from rigorous.

But it also showed that the standard staging metrics were very poor predictors of outcomes and had been created largely to give medical students something they could easily memorize and regurgitate on their tests. The new rubric was far more data-driven, and I leaped at the chance of quantify my own illness by it.

p.192

my age, diameter of largest involved node, bone-marrow involvement, β2-microglobulin status, and hemoglobin levels. Of the five features most strongly correlated to early death, it seemed to appear that I exhibited only one.

my risk factors and survival rate.

HC79.155 L435 2018 (print); 338.4; https://lccn.loc.gov/2018-17250; 2018, )

____________________________________

Michael Lewis, The undoing project, 2017

p.228

Amos Tversky and Don Redelmeier

“Discrepancy between Medical Decision for Individual Patients and for Groups”, April 1990

p.228

“Physicians deal with patients one at a time, whereas health policy makers deal with aggregrate.”

But there was a conflict between the two roles.

(Michael Lewis, The undoing project, 2017, )

____________________________________

Executive summary of ‘expert, The expert’

• experts are not perfect; they make mistake, because there is always a degree of uncertainty, no matter how close to zero that uncertainty might be; however, most experts and public figures, do not want or like to admit that uncertainty do exist, because to admit to uncertainty is to admit to the possibility of being wrong, in another word ‘error’; and error, usually, have consequences;

• many times, the experts do not follow the decision-making process (the set of rules or factors that they use to come to a decision or conclusion) that they would tell you that they use in practice; in other words, they say one thing, but in practice, they do some thing else - a bit different (the words do not match up with actions);

• just like the rest of us, experts do make mistake, and they tend to make the same mistake, over and over again; specifically, experts make the kind of mistakes that is in the design and structure of the system; not only that, these mistakes are usually hidden or invisible; and just like the rest of us, experts do not like to admit that they made the mistake; they might attribute the mistake or error to randomness, which is another way of saying, they don't really know why the mistake or error happened;

• this is not to say, we should not listen to the experts

• we want the experts to explain, to frame the information, to create a public mental model, to teach, to give illustrative meaningful, relate able, practical examples, to tell stories, to create an understanding, maybe, even multiple understandings, to dis spell misunderstand, to forewarn pitfalls; ...

• there is no such thing as a public mental model, people have mental model; the public is an abstraction - a potentially helpful, useful, fictional creation;

• the person has a mental model

• a mental model probably can be determine within a team to a degree

• however, a public mental model is a label I made up by combining: (public) + (mental model) := (public mental model)

____________________________________

Michael Lewis, The undoing project, 2017 [ ]

p.171

Goldberg said he preferred to start simple and build from there. As his first case study, he used the way doctors diagnosed cancer.

pp.171-172

They had found a gaggle of radiologists at the University of Oregon and asked them: How do you decide from a stomach X-ray if a person has cancer? The doctors said that there were seven (7) major signs they looked for: the size of the ulcer, the shape of its borders, the width of the crater it made, and so on. The “cues”, Goldberg called them, as Hoffman had before him.

p.172

Goldberg pointed out that, indeed, experts tended to describe their thoughts processes as subtle and complicated and difficult to model.

p.172

The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignnant depended on the seven (7) factors the doctors had mentioned, equally weighted.

p.172

96 different individuals stomach ulcers, on 7-point scale from “definitely malignant”, “definitely benign”. Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn't notice they were being asked to diagnose the exact same ulcer they had already diagnosed.

p.172

The researchers didn't have a computer. They transferred all of their data onto punch cards, which they mailed to UCLA, where the data was analyzed by the university's big computer. The researchers' goal was to see if they could create an algorithm that would mimic the decision making of doctors.

p.173

But then UCLA send back the analyzed data, and the story became unsettling. (Goldberg described the results as “generally terrifying”.) In the first place, the simple model that the researchers had created as their starting point for understanding how doctors rendered their diagnoses proved to be extremely good at predicting the doctors' diagnoses.

p.173

The doctors might want to believe that their thought processes were subtle and complicated, but a simple model captured these perfectly well.

p.173

More surprisingly, the doctors' diagnoses were all over the map: The experts didn't agree with each other. Even more surprisingly, when presented with duplicates of the same ulcer, every doctor had contradicted himself and rendered more than one diagnosis: These doctors apparently could not even agree with themselves.

p.173

Experience appeared to be of little value in judging, say, whether a person was at risk of committing suicide. Or, as Goldberg put it, “Accuracy on this task was not associated with the amount of professional experience of the judge.”

p.174

Still, Goldberg was slow to blame the doctors.

p.174

How could their simple model be better at, say, diagnosing cancer than a doctor? The model had been created, in effect, by the doctors. The doctors had given the researcher all the information in it.

p.174

The Oregon researchers went and tested hypothesis anyway. It turned out to be true. If you wanted to know whether you had [stomach ulcer] cancer or not, you were better off using the algorithm that the researchers had created than you were asking the radiologist to study the X-ray. The simple algorithm had outperformed not merely the group of doctors; it had outperformed even the single best doctor.

p.17

You could best the doctor by replacing him with an equation created by people who knew nothing about medicine and had simply asked a few questions of doctors.

p.174

Lew Goldberg, “Man versus Model of Man”

p.175

It was as if the doctors had a theory of how much weight to assign to any given trait of any given ulcer. The model captured their theory of how to best diagnose an ulcer. But in practice they did not abide by their own ideas of how to best diagnose an ulcer. As a result, they were beaten by their own model.

p.175

But in practice they did not abide by their own ideas of how to best diagnose an ulcer. As a result, they were beaten by their own model.

p.175

Why would the judgement of an expert--a medical doctor, no less--be inferior to a model crafted from that very expert's own knowledge?

(Michael Lewis, The undoing project, 2017, )

____________________________________

Clayton M. Christensen, The innovator's prescription, 2009

p.391

The Joint Commission on Accreditation of Health Care Organisations also weighted in to require teleradiology services to meet licensing and accreditation standards that have long been in place for hospital-based solution shops of radiologists.46 The result: a typical NightHawk radiologist has licenses in 38 states and is credentialed at over 400 hospitals. The company employs 35 to 40 people simply to manage all of this administrative overhead--and yet can still provide these services to lower cost than most of its customers can when they choose to perform them in-house.47

However, a funny thing is happening at the edge of this stalemate. A growing segment of work is no longer dependent on a radiologist's expert eye and clinical experience to interpret shadowy anatomical strutures and link them to patients' clinical histories and physical symptoms.48 “Functional” radiology, involving dynamic in-motion studies and molecular tracers rather than still pictures, and “quantitative” radiology--a related discipline based on measurements and scoring algorithms--have significantly enhanced the ability of nonradiologist physicians to elucidate physiologic abnormalities.49 Starting with basic technologies like ultrasound and fluoroscopy, these machines automate image acquisition and analysis, embedding into algorithms some of the diagnostic skill that used to reside only in the intuition of radiologists. These machines also require less space, shielding, and power, so they can be integrated into the offices of cardiologists and orthopedic surgeons working in value-adding process clinics.50

NOTES

48. Our thanks to Dr. Keith Batchelder and Peter Miller of Genomic Healthcare Strategies for suggesting these technological enablers of disruption in radiology.

( Christensen, Clayton M., 2009, The innovator's prescription : a disruptive solution for health care / by Clayton M. Christensen, Jerome H. Grossman, Jason Hwang., 1. Health services administration., 2. Public health administration., 3. Disruptive technologies., RA971.C56 2009, 362.1 Christen, )

____________________________________

Are Translator Devices Worth it in 2020? Testing it in Japan

https://www.youtube.com/watch?v=p6TF1iUi6fQ

13:12

Tokyo Lens

Jan 14, 2020

Today we are in Tokyo, Japan putting a translator device to the test and seeing if these kinds of devices are worth it in the modern day and age of 2020. With the Olympics hitting Tokyo this year, plenty of people who don't speak Japanese (or English) will be making their way to Japan, and its time to see if a device like this can help.

Doing a full review of this translator device (which can be used with and without internet!)

THE DEVICE:

https://amzn.to/3adZmsf

This code should get you 10% off lens10lw

I GET NO MONEY from the code or anything, but any purchase you make on amazon through clicking one of my amazon links, does give support to the channel~

____________________________________

https://www.amazon.com/gp/product/B07LF9XPJW/

Langogo Genesis Portable Language Translator Device, 100+ Languages Pocket Translator, Real-time Voice Translator with Offline Translation, Built-in Data, 3.1inch Retina Display Traductor, Black

⚡【Reliable Travel Buddy】Langogo enhances the travel experience, it helps you overcome cross-language barriers, always stay connected via its mobile hotspot and get local information like hotels to stay, attractions to visit, as well as weather forecast and so on while traveling overseas.

⚡【One-Button Accurate Translation】Langogo offers an online two-way translation in one second with a single button. Powered by 24 world-leading translation engines, it ensures the translation even against different accents of the highest accuracy for 104 languages.

⚡【Voice Recording and Transcription】Genesis records a single speech up to 4 hours and instantly shows the transcription on the screen so you can focus on meeting and interview. Enjoy free English transcription before 2021 and a 1-month trial for the others.

⚡【Mobile Wi-Fi Hotspot】Genesis is also a mobile hotspot device. With the built-in eSIM chip, it allows a purchase on the device for hotspot data plan to offer a Wi-Fi connection for up to 5 mobile devices, with no extra SIM card.

⚡【Enjoys Continuous Update】The self-learning algorithm and continuous update improve its performance. More function and up-to-date vocabulary are adding to it, and the more you use it, the more precise it becomes.

Langogo Genesis AI Translator with Wi-Fi Hotspot

Langogo Genesis is specially designed to help you to improve your travel experience. It integrates 24 translation engines with its one-button translation design to enhance the accuracy and convenience of the speech-to-speech translation. In addition, Langogo Genesis can be used as a mobile Wi-Fi hotspot, which keeps you connected to the internet while traveling abroad and saves your phone battery. You can then focus on sharing memorable moments along the way with all your loved ones.

To use Langogo with updated languages and latest functions, please always check your system version and upgrade before using it.

translation in one second, two-way translator easy use and convenient language translator device

One-button Translation

Langogo offers a unique one-button two-way translation. It can automatically recognize the inter-translation language, which means when you say one language, Langogo translates your words to the other automatically. No A/B buttons, no extra App.

The translation process is based on 24 translation engines integrated and its self-learning algorithms, therefore Langogo translates with the highest accuracy and efficiency.

More than a translator, Langogo Genesis is also an intelligent voice assistant. It can deliver useful information including weather forecasts, exchange rates, nearby attractions and hotels, and so on. More powerful skills, such as navigation, travel guides, taxi booking, etc., will be available shortly.

Langogo supports lifetime update service, which continuously improves the stability, performance, and safety of Langogo. New languages and system functions will be constantly replenished and available for an online update on your Langogo.

Languages Translated and Countries Covered to Use eSIM

• Languages Translated Online: Arabic (Algeria), Arabic (Bahrain), Arabic (Egypt), Arabic (Iraq), Arabic (Jordan), Arabic (Kuwait), Arabic (Lebanon), Arabic (Morocco), Arabic (Oman), Arabic (Qatar), Arabic (Saudi Arabia), Arabic (State of Palestine), Arabic (Tunisia), Arabic (United Arab Emirates), Armenian (United States), Azerbaijani, Basa Sunda, Bulgarian, Catalan, Czech, Croatian, Chinese (Mandarin), Chinese (Cantonese), Chinese (Taiwan), Danish, Dutch, English (Australia), English (Canada), English (UK), English (Ghana), English (Ireland), English (India), English (Kenya), English (Nigeria), English (New Zealand), English (Philippines), English (Tanzania), English (United States), English (South Africa), Finnish, Filipino, French (Canada), French (France), Georgian, German (Germany), Greek, Gujarati (India), Hebrew, Hindi (India), Hungarian, Icelandic, Indonesian, Italian, Javanese (Indonesia), Japanese, Kannada, Korean, Lao, Latvian, Lithuanian, Malay, Nepal, Norwegian, Persian, Polish, Portuguese (Brazil), Portuguese (Portugal), Romanian, Russian, Serbian, Sinhalese (Sinhala), Slovak, Slovenian, Spanish (Argentina), Spanish (Bolivia), Spanish (Chile), Spanish (Colombia), Spanish (Costa Rica), Spanish (Dominican Republic), Spanish (Ecuador), Spanish (Spain), Spanish (Guatemala), Spanish (Honduras), Spanish (Mexico), Spanish (Nicaragua), Spanish (Panama), Spanish (Peru), Spanish (Puerto Rico), Spanish (Paraguay), Spanish (El Salvador), Spanish (United States), Spanish (Uruguay), Spanish (Venezuela), Swahili, Swedish, Tamil (India), Telugu (India), Thai, Turkish, Khmer (Cambodia), Ukrainian, Urdu, Vietnamese, Zulu

• Languages have only translation displayed in text: Azerbaijani, Persian, Gujarati (India), Armenian (United States), Icelandic, Georgian, Kannada, Lao, Lithuanian, Latvian, Serbian, Swahili, Urdu, Zulu

• Languages Translated Offline: Chinese, English, Japanese, Korea

• Countries and Regions Supporting eSIM for Translation: Albania, Australia, Austria, Bangladesh, Belarus, Belgium, Bulgaria, Cambodia, Canada, China Mainland, Croatia, Cyprus, Czech Republic, Denmark, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, Hong Kong, Hungary, Iceland, Indonesia, Ireland, Israel, Italy, Japan, Kazakhstan, Kyrgyz Republic, Laos, Latvia, Liechtenstein, Lithuania, Luxembourg, Macao, Macedonia, Malaysia, Malta, Netherlands, New Zealand, Norway, Oman, Philippines, Poland, Portugal, Qatar, Romania, Russia, Saudi Arabia, Serbia, Singapore, Slovakia, Slovenia, South Africa, South Korea, Spain, Sri Lanka, Sweden, Switzerland, Taiwan, Tajikistan, Thailand, Turkey, Ukraine, United Arab Emirates, United Kingdom, United States, Vietnam

• Countries and Regions Supporting eSIM for Hotspot Sharing: Argentina, Australia, Austria, Belarus, Belgium, Brazil, Bulgaria, Cambodia, Canada, Chile, China Mainland, Croatia, Cyprus, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hong Kong, Hungary, Iceland, India, Indonesia, Ireland, Italy, Japan, Laos, Latvia, Liechtenstein, Lithuania, Luxembourg, Macao, Macedonia, Malaysia, Malta, Mexico, Netherlands, New Zealand, Norway, Peru, Philippines, Poland, Portugal, Romania, Russia, Serbia, Singapore, Slovakia, Slovenia, South Korea, Spain, Sweden, Switzerland, Taiwan, Thailand, Turkey, Ukraine, United Kingdom, United States

• System Language: Chinese(Traditional), Chinese(Simplified), English, French, German, Japanese, Korean, Spanish, Thai

____________________________________

··<────────────────────────────────────────────────────────────────────────────>