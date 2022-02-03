Tools like DALL-E 2 or Stable Diffusion or ChatGPT, which is being talked about a lot these days, are very impressive. The former is able to create images based on a text description, the latter is an intelligent conversational agent that can answer almost any question or generate custom text. These technologies are so advanced that it is sometimes hard to believe that their result is not the work of man. However, as Melissa Heikkila explains in MIT Technology Review, such an abundance of “artificial” texts can be more problematic than it seems.

ChatGPT is like an encyclopedia that is available 24 hours a day and has (almost) every question answered in record time. Mathematics, history, philosophy, … nothing escapes him. But where this conversational agent – which is based on OpenAI’s GPT-3 language model – particularly stands out is in text generation. Whether it’s a fictional story, an email, a joke, a newspaper article, etc., he can write clear, understandable, and credible writing on any topic. In less than a month of its existence, more than a million people have already used it.

While this feature has the potential to allow students to write essays effortlessly, it can also have much more serious implications. Melissa Heikkila mentions health advice content – which would not be endorsed by a real health professional – or other important informational content. “AI systems can also foolishly contribute to the creation of a lot of misinformation, abuse and spam by distorting the information we consume and even our sense of reality,” she writes.

There are some tools for detecting texts generated by artificial intelligence, but they are ineffective against ChatGPT, the journalist says. Today, the greatest concern is not so much the fact that it is impossible to determine the origin of the text (human or artificial), but above all that the Web can very quickly become filled with mostly incorrect content. Why? Because the AIs are learning from the content they get from the internet… that other AIs have created themselves!

Initially, computer language models are trained on datasets (texts and images) found on the Internet. This can range from good content to misleading and malicious content posted by some people. The AI trained on this data in turn creates false content that is shared across the web… and used by other AIs to create even more convincing language models that humans can use to create and spread further false information, and so on. .

Now this phenomenon extends to images. “The internet is now forever polluted with AI-generated images. The images we took in 2022 will be part of every model we create from now on,” says Mike Cook, an AI researcher at King’s College London.

From all this, we can conclude that it will be increasingly difficult to find good, non-artificial intelligence data for training future artificial intelligence models. “It’s very important to ask if we need to train on the entire web or if there are ways to filter out high-quality material that will give us the language model we want,” Daphne Ippolito, senior fellow at Google Brain, Google’s deep learning research arm, said in an interview with MIT Technology Review.

How to detect text generated by artificial intelligence?

Therefore, it becomes necessary to develop tools for detecting texts generated by AI. Not only to ensure the quality of future linguistic models, but also to ensure that the information we have access to on a daily basis is based on truth. As Melissa Heikkila points out, people might try to submit AI-generated scientific papers for peer review or use technology as a tool for disinformation – which would be especially harmful during elections, for example.

Humans also have a role to play in this fight against artificial content: they must become more savvy and learn to recognize non-human content. “Humans are sloppy writers”: text written by a real person will contain typos or spelling errors, a few slang words, sometimes confusing turns of speech – all these are small signs that AI cannot reproduce (at least not yet). Moreover, language models work by predicting the next word in a sentence, so they use mostly the most common words and very few rare words.

Here is the form for you. When it comes to content, it’s also important to simply take your mind off what you’re reading online. It is important to note, for example, that the ChatGPT training phase ended in 2021, so the tool relies on data that is online that year. Therefore, answers requiring knowledge after this date are bound to be wrong, outdated, or made up.