Synthetic Data: Illusions, Lies, Deceptions and the Reality

January 18, 2024

506

It will be supremely tough to differentiate between reality and deception in an AI dominated world that uses synthetic data.

Data is synonymous with the word ‘information’ and in the 21^st century, it is the most valuable asset. With every passing moment, more and more data is pumped into society. A common mistake that people make is to link the word ‘data’ with something related to computers or digital devices. It isn’t. Every book written and every picture painted forms a part of the ‘data’ circulating through society. The first form of data to be ever recorded can be traced back to cavemen’s paintings, stone sculptures, and so on.

Human beings have the innate tendency to try to replicate whatever they see or perceive. For example, the cavemen drew the animals they hunted, the storms that occurred, and so on. In the digital age, it is much easier to capture pictures of what we see, and record videos of what we experience. Apart from just what we perceive, what we do also forms a major part of the data in circulation — for example, every bank transaction we carry out, every location we visit, and so on. And this data is not mere information; it is a valuable part and parcel of everyone’s life, and defines who they are. A computer system is able to perfectly predict one’s behavioural patterns with this data. This is nothing new. All of us have witnessed random ads on websites about the exact thing we are thinking about. There are even jokes around this, saying a computer is able to read minds, but in reality, it is able to perfectly predict your behaviour.

Having used Google as my default search engine since I first touched a computer, and an Android device for over 7 years, my entire digital history can be summed up on a single web page within my Google account. This is not true just for me; it’s true for millions or even billions around the globe. You could try this out; explore what a single organisation ‘Google’ knows about your life. Just visit myactivity.google.com. And if one organisation has so much information about you, imagine how much more is out there if we count the information with multiple organisations including banks, travel agencies, e-commerce websites, and so on.

Well, this was about the information of what we do and what we see. But a significant amount of data is about what does not exist, is imaginary and an illusion. As discussed before, even this illusionary data is not limited to the computer realm. Every fairy tale ever written, every lie ever told, and every image drawn from one’s imagination forms a part of the data about things that do not actually exist.

However, it is to be noted that even if data is not true, it doesn’t mean that it is not useful. Every child picks up their first book of alphabets, and the first word on the first page is usually ‘A for Apple’. But the picture beside that is generally a hand-drawn picture of an apple coloured uniformly red. As adults, we would agree that a real apple does not look like that picture. But that picture will help a child identify an apple when he or she finds one in real life. However, if a child is presented with a green apple, he or she may mistake it for a guava or a pear. This is because the pictures of apples are usually red, and the pictures of guavas or pears are usually green. So what can be done to avoid this situation? The solution is to introduce more pictures of apples to the child — both red and green. And the child’s mind then starts to identify the fruit by its shape and not the colour.

Figure 1: Children’s book showing letter ‘A’

This is similar to what is done when building a machine learning model. A model needs to be introduced to or trained with a vast amount of data to be able to identify objects properly. And these ‘vast amounts of data’ for general items are available as data sets for model training. One of the most widely used data set for object recognition is the COCO (common objects in context) data set containing over 330,000 images.

But what is to be done when we are trying to identify some rare objects, about which so much of data does not exist? Let us go back to the analogy of a child. How would a child ever identify a dragon or a unicorn if he or she ever comes across one? There is obviously no original picture of a dragon or a unicorn the child has ever possibly seen. But he or she has gone through drawings and artist impressions, which is a form of representation from imagination. It is data about something that does not exist. The data is entirely made up to help the child identify the dragon even if it never existed. Similarly, even for computers, in such cases, the data has to be made up. Made-up data is also known as ‘synthetic data’ and plays a very important part in the training of machine learning models.

But this is just about object recognition. Machine learning models can be used for a lot more purposes than just object recognition. For example, a model can be made to identify fraudulent transactions from bank statements. However, there may not be so many records of fraudulent transactions available to train the model. In such scenarios, synthetic data has to be prepared.

A very different type of artificial neural network called generative adversarial neural (GAN) network is used to generate synthetic data. But what are the moral obligations to the generation and use of synthetic data? That boils down to the fact that synthetic data is not real data and does not have a physical significance. For example, let’s look at this question — how safe is lying? Well, telling a child that a dragon lives in the mountains may be safe, but stating someone was killed by a dragon is not legal and an insurance fraud. It is the same lie, but the situation it is used in shows whether it is safe and legal.

As we approach the end of 2023, recent months have witnessed a substantial increase in the capabilities of AI. Generative AI now crafts stories, draws pictures, and performs various tasks with heightened proficiency. But GANs have been around for a long time. It’s just that generative AI has made them more accessible to everyone. This has increased cybercrimes and frauds a lot. For example, the frauds over voice calls sound more realistic now because the scammer can actually use AI to modulate his or her voice to the voice of someone the victim knows. For the past few months, after Photoshop added the Generative AI fill feature, it has been a joke that people could fake being sick by photoshopping being admitted into a hospital. This shows how easy it is to tamper with evidence these days.

A prime example is deepfakes, which can accurately swap a person’s face with another with a high level of accuracy, often escaping detection by the naked eye. A recent example might be the deepfake of a popular south Indian actress doing the rounds on social media, which was made from a video of a social media influencer. Deepfakes have been used by many criminals and in many cyber frauds. There has been a rising case of deepfakes on indecent pictures or videos to harm someone’s reputation.

With the amount of synthetic data being pumped into society every single day, it is quite difficult to differentiate between reality, the illusion, and the deception and lies. As children, we trusted what we were told, and believed in dragons, unicorns, and fairies. But we soon outgrew that. As adults we may now need to enhance our discerning capabilities substantially to figure out what’s real and what’s not.

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY