AI has three cornerstones: computing power, data, and algorithms.
Among the three, computing power is the most important, so the market value of Nvidia, which "sells shovels", once surpassed Microsoft and Apple to become the world's most valuable company. However, as Alex Wang, founder of Scale AI, emphasized in a podcast, data is replacing computing power and becoming the biggest bottleneck for improving AI model performance.
AI's thirst for data is endless, but accessible Internet data resources are almost exhausted. To further improve model performance, we must rely on more high-quality data. Although enterprises have a large amount of valuable data within their own companies, these unstructured data can only be truly used for AI training after being carefully labeled. Data labeling is a resource-intensive task and has long been regarded as the hardest and most humble part of the AI industry chain.
However, it is precisely by virtue of the strategy of being the first to enter the field of data annotation that Scale AI has obtained a valuation of 13.8 billion US dollars in the latest round of financing in May this year, surpassing many well-known large model companies. This achievement undoubtedly breaks the prejudice that "data annotation is just hard work".
Just as many decentralized computing power projects have challenged Nvidia, the encrypted AI project Sapien AI, which just completed a $5 million seed round in April this year, also tried to challenge Scale AI. It not only wants to cut into the long-tail market through a decentralized way, but also plans to build the world's largest manual data annotation network.
Recently, BlockBeats interviewed Trevor Koverko, co-founder and COO of Sapien AI. As the co-founder of several successful projects such as Polymath, Polymesh and Tokens.com, Trevor had accumulated rich entrepreneurial experience before founding Sapien AI. In the interview, he shared in depth his experience in founding Sapien AI, as well as his unique insights on how Sapien AI and Scale AI compete in different ways, and how to draw inspiration from blockchain games to design business mechanisms. Sapien AI project experience website: game.sapien.io BlockBeats: I saw from your LinkedIn that you played for the NHL New York Rangers. As a former professional ice hockey player, how did you transition into the encryption industry? Trevor:I have tried many different roles in my career. Ice hockey was my first job. In Canada, ice hockey is a very important part of our culture. If you didn't play ice hockey when you were a child, you would almost be regarded as an alien. So, this is an important part of my growth. I learned a lot from teamwork and high-level competition, and these experiences still affect me today. When my hockey career ended, I started to pursue business, and I actually spent some time in Asia. I lived in China, specifically in Dalian, a city in northeastern China. My sports career and my experience in China were two very important parts of my growth. I grew up in the crypto ecosystem in Toronto. I got involved in the Bitcoin community very early, before Ethereum was launched. We often went to meetups, talked with friends, and met Vitalik, who was just the editor of Bitcoin Magazine at the time. Later, when Vitalik published the white paper, the Bitcoin community gradually evolved into the Ethereum community. It was a passionate time. I launched my own RWA project Polymath in 2017-2018, when there was not even a clear classification in this field. We called it "security token". This was my first major project in the crypto field. We did all aspects of this project, from raising funds to launching applications on Ethereum. Eventually we also built our own Layer 1 blockchain, which was a bigger challenge. Fortunately, we had very smart people like Charles Hoskinson as protocol architects. Today, this blockchain has evolved into an independent brand called Polymesh. It is one of the earliest and largest RWA networks, and it is at the Layer 1 level. Now I am just a community member, because it is completely decentralized, I just support this network from a distance. In terms of adoption, it has performed very well, and now RWA is gradually becoming an exciting ecosystem. BlockBeats: What made you turn your interest from RWA to AI and decide to start Sapien AI? Trevor:After Polymesh's daily operations were decentralized, I became interested in AI. Toronto has a very strong AI technology community, and many early architectures of modern AI were created by researchers at the University of Toronto, such as Geoffrey Hinton, the "father of deep learning", and Ilya Sutskever, former chief scientist of OpenAI. I was very interested in using AI, and I also had a group of smart friends working on machine learning at the University of Waterloo. I gradually became interested in the AI technology stack, how it works, the production process of training data, and how humans can participate in the production of this training data. It was a very natural learning process. I didn't have the ambition to start a company at first, but after about 6 months of diving into the field of AI and machine learning, under the guidance of a mentor in the machine learning graduate program at the University of Waterloo, we began to find some interesting areas of problems and opportunities to solve these problems. Eventually, we founded Sapien. BlockBeats: Can you introduce the core mission of Sapien AI for those who don't know it? Why is data annotation service important in the current AI industry? Trevor: Data annotation is extremely important. This is also one of the main reasons for the success of mainstream large language models such as ChatGPT, because they are the first models to use industrial-scale human data annotators to enrich the data set. Today, data annotation is becoming increasingly important because the performance competition between these models is very fierce, and the best way to improve model performance is to add more professional human data annotations to the dataset. Innovation fertile ground Toronto, the creative crystallization of the encryption and AI communities
Left: Ilya Sutskever; Right: Geoffrey Hinton
We think of data processing as a supply chain: first there is raw data, then it needs to be structured and organized. Once it is structured, it can be trained. Once it is trained, it can be inferred. In short, it is a process of incrementally adding value to data in the context of AI.
Just like other industries, we are starting to see segmentation in the AI industry, with different verticals emerging and certain companies excelling at specific steps of the process. For me, the most interesting is the second step, structuring and preparing the data for training, which has always been the part that interests me the most.
BlockBeats: What makes Sapien AI different from traditional Web2 companies like Scale AI?
Trevor:That’s a great question. We admire Scale, they are an amazing company with amazing co-founders. We know one of them. They are one of the largest AI companies in the world, both in terms of revenue, market cap, and usage.
What’s different about us is that we think from first principles about what a modern data annotation stack should look like in 2024. We’re not necessarily going after the use cases that Scale covers, we’re targeting the mid- and long-tail markets.
We strive to make it easy for anyone to get human feedback on your dataset, whether you’re working on an open source model for the mid-market, an enterprise-level model, or just an individual doing research on the weekend. If you want to improve model performance and need human feedback on demand, come to us.
You can think of us as a more distributed or decentralized version of Scale AI. This means that our annotators are more widespread and are not tied to a specific location, but can work remotely from anywhere. In a way, this decentralization allows us to do better in the quality of data annotation, because diversity is not just for diversity, but also for the quality of data training.
For example, if you have a bunch of people with similar backgrounds annotating data in one facility, it is likely to produce biased or culturally biased data output. Therefore, we strive to make it as diverse and robust as possible from the beginning. We also get higher quality annotators to some extent because of being more decentralized. If you have to work in a specific location in the Philippines, you are limited in the talent you can attract, but by using a remote-first approach, we can find annotators from anywhere.
I'm not saying Scale isn't doing these things, but we're thinking about how to serve other parts of the model market. Because we think this market is going to grow, and there's going to be a lot of private and permissioned models that need human feedback.
BlockBeats: How is the data annotation workflow designed and optimized at Sapien AI? What are the key steps to ensure data quality?
Trevor:Our platform works like a two-sided market. You can think of it as the Uber of data annotation, a decentralized version.
On one side are the demand side, which is like the riders in Uber, and for us are enterprise customers who need human feedback on their models. For example, they're building a big language model and want to fine-tune the model, and that's when humans need to be involved.
They come to us and upload their raw datasets to the network. We give quotes based on a few different variables of the dataset, like complexity, data modality, data format, etc. For enterprise customers, the process is very self-service.
The other side is the supply side, the annotators, who are our equivalent of Uber drivers. Right now, this is actually the bottleneck of the industry, and we need as many annotators as possible to join the network. Because the demand is basically unlimited, just like Uber, there are always people who want to take a ride, and this demand will never end. In the AI field, the demand for these AI models to consume more data is also constant.
We focus a lot on the supply side and are committed to making data annotation easy for anyone. We have invented some new technologies and are still constantly improving these technologies to ensure high-quality annotation at scale in a distributed model. The question we asked initially was, can we ensure high-quality annotation without centralized management? This is actually what we call the "data annotation trilemma": can we make the cost of customers lower, the income of annotators higher, and improve the overall quality?
We have conducted many experiments in this area and have achieved some very interesting results. We tried different new mechanisms like mean regression, anomaly detection, and mixed some probabilistic models that can largely infer the quality of the labeler's work. We are still working on some newer technologies. But so far, we are very excited about the development prospects of data annotation in the next five to ten years. We believe that data annotation will become more decentralized, more self-service, and more automated.
BlockBeats: Can you tell us more about your products and technologies, especially those that ensure data quality? I know you have a staking mechanism to prevent annotators from doing bad things, are there other technologies?
Trevor: Yes, we are trying many different approaches. We have a reputation system, and we have a staking and penalty mechanism. After annotators stake a certain amount of funds, they may be fined if they fail to meet the standards. These mechanisms are still in the early experimental stages, but we have found that this incentive mechanism alone can significantly improve quality compliance, even by multiple standard deviations. However, this series of quality controls is achieved by a weighted average of different algorithms, and we are constantly fine-tuning these algorithms. At the same time, we are also using machine learning ourselves to optimize this process. For example, we use ML linter tools and "Red Rabbit" tests, which is to provide fake data to annotators to test whether they are annotating honestly.
This is a big question: How do you know if people are Sybiling the network (i.e. trying to cheat and manipulate the system)? We have to be vigilant about this all the time. This is also why we like some of the Web3 incentive mechanisms, because these mechanisms were originally invented to solve similar Sybil attack problems, the Byzantine Generals Problem, to make it in everyone's best interest to follow the rules. If you are selfish, you will follow the network protocol.
We are still in the early stages. For some large customers, we have implemented more traditional quality control methods, and we are also moving quickly into this new frontier data world.
BlockBeats: What do you think is the biggest advantage of Sapien AI as a decentralized data annotation platform?
Trevor: As I said, our platform is more self-service, which allows us to serve a wider customer base. We also have very broad requirements for annotators. We want anyone to be an annotator because we believe the next era or next chapter of AI will be about extracting more existing knowledge from humans. Not just basic stuff like "this is a stop sign" or "this is a car" that both humans and machines can easily recognize, but more about reasoning.
Alex Wang of Scale talked about this: The data on the Internet is the result of reasoning, but it doesn't really describe the reasoning process. So how do we understand people's minds more deeply? This requires more work and more professional annotations. This has the potential to help us accelerate the development of general artificial intelligence (AGI).
So our larger mission is: Can we unlock more knowledge in private datasets inside of enterprises, in the minds of professionals who have expertise in certain verticals, like healthcare or law, that the models haven't yet captured?
We're still working on making our platform as liquid as possible, trying to keep supply and demand in balance. We want to have dynamic pricing, like Uber. These mechanisms make us more like a true two-sided marketplace, where we meet data demand and help annotators join. These are some of the unique ways we built our platform. For quality assurance, we use the techniques I mentioned earlier in real time. We want our annotators to get as much real-time feedback as possible because it creates a better experience for everyone.
BlockBeats: I noticed that Sapien AI has reached a cooperation with the game guild Yield Guild Games (YGG), so can Sapien AI's decentralized labeling mechanism be understood as a "label to earn" game?
Trevor: Absolutely right. We do want to be able to enter the world of those who want to make a living through their mobile phones, and we think this is the future of the gig economy. You don't need a car to drive Uber, you don't need to be in a physical location to deliver food, you just need to log in to your phone, label data, and you can earn income.
YGG is an amazing partner, they are one of our angel investors. We have a great relationship with founder Gabby, and they have an amazing community in Southeast Asia. We have big plans with them to help their users find new ways to make money, while they help us acquire new users. We recently announced a few partnerships, and there are more plans in the works for the future. We will also be in Asia for most of Q4, meeting with these partners and continuing to drive collaboration.
BlockBeats: What do you think of "play to earn" blockchain games like Axie Infinity?
TlockBeats: It's very innovative and a source of inspiration. Although it's just an experiment, I believe it will come back in a new form. That's the beauty of startups and decentralized entrepreneurship, it's a kind of creative destruction.
There are definitely some "play to earn" elements to what we're doing, and we also tend to use the phrases "label to earn" or "train to earn." But there's a difference because we're a real business. There's real data being labeled, there's real customers paying real money, and ultimately there's a real product being produced. So it's not just an endless loop of a video game.
While labeling data with Sapien AI is fun, it's probably not as fun as playing Grand Theft Auto V. We want to strike a good balance between fun and useful, so that it's something you can do while you're waiting at the bus stop for 5 minutes, or you can do while you're at home in front of your computer for 5 hours. Our goal is to make it as accessible as possible.
BlockBeats: Are there ways you can make data labeling more fun, more like a game than a job?
Trevor: Yeah, we're experimenting with a lot of things right now. You can try out the game and annotate real AI data at game.sapien.io. You can become an AI worker and earn points by annotating real AI data while playing the game. The game is very simple and has an intuitive interface.
game.sapien.ioGame Interface
The data itself is also interesting. You may need to annotate some very interesting images, such as annotating our fashion data. We plan to support a variety of different types of modalities and datasets. We plan to continue to add more features over time.
BlockBeats: In addition to YGG, what other crypto projects do you plan to work with in the future?
Trevor:We have some interesting ideas, such as creating a data standard for data annotation. Right now, this space is quite chaotic, and each customer's needs are different. We have to do custom integrations with each customer because their data formats and data modalities are different.
So we are in the early stages of building this standard, working with others in the decentralized data space, and planning to release it as a public product. We did something similar when we were at Polymath, where we released ERC-1400, which is now one of the default standards for tokenization on Ethereum.
So we have some ideas about creating standards and plan to drive this forward with the team that has helped us in the past, as well as some industry partners. This will make decentralized AI more real, and it will also make it more interoperable, meaning data can flow between different steps more easily because no one person can do everything.
BlockBeats: When is the specific release date for Sapien AI mainnet and mobile apps?
Trevor:We don't have a specific release plan at this point. We are focusing on our core Web2 product market fit right now. We are growing very well and now have annotators from 71 countries. Our revenue on the demand side has been doubling almost every month this year.
We just want to continue to grow, learn more about our customers, and continue to serve them. We're open to a variety of different strategies and technologies over time.
BlockBeats: I saw that Base co-founder Rowan Stone has joined Sapien AI as Chief Business Development Officer. Which blockchain will Sapien AI be built on? Are there plans to issue a native token?
Trevor:These are very thoughtful questions, and I appreciate them. Rowan is great. He founded Base with Jesse Pollak, and Jesse is an absolute legend. Rowan has a wealth of experience and is unmatched in building industrial-grade Web3 products. In my opinion, he is second to none. He co-led the "Onchain Summer" event, which was one of the most successful events I can remember.
He is helping us develop our go-to-market strategy in certain areas. But, like I just said, we are currently very focused on serving our existing customers, and that is our main focus. We have not made any commitments or decisions in terms of choosing any Layer 1 or otherwise. But in the future, we will continue to consider various possibilities.
BlockBeats: What are Sapien AI's plans or goals for the future? What milestones do you hope to achieve in the next few years?
Trevor: Our mission is to increase the number of human data annotators in the world by 100 times and make this network easily accessible to anyone. We want to build the largest network of human data annotators in the world. We think this will be a very valuable asset, so we want to build it and control it, but eventually open it up. We want anyone to access it and be completely permissionless.
If we can build the largest human data annotation network in the world, this will unlock a lot of potential AI capabilities because the more high-quality data we have, the more powerful AI will be and the more it will be available to everyone.
We want it to work for everyone, not just the large language model companies that can afford a network of millions of human annotators. Now, anyone can use this network. You can think of it as an "annotation as a service" platform.
BlockBeats: Finally, I would like to ask about your observations and views on the entire industry. What untapped potential do you think exists in the field of encrypted AI?
Trevor:I am very excited about this field, which is why we founded Sapien AI. There are good sides and there are also sides that need to be guarded against.
The good side is that decentralized AI may be more autonomous, more democratic, more accessible, and more powerful. This means that AI agents can have their own native currency for transactions, which also means that you can have more privacy and know exactly what is included in the model through ZK technology.
In terms of defense, we are facing a very scary world in which AI becomes increasingly centralized and only governments and a few large technology companies have access to powerful models. This is a pretty scary scenario. So open source and decentralized AI is a defensive measure.
For us, we are more focused on the data side, decentralizing the data. It doesn't mean you can't decentralize other parts of the AI stack, like the compute and the algorithms themselves. Just like Transformer was the first innovation in the algorithmic side, we have seen more innovations, but there is always room for improvement.
Decentralization doesn’t mean you should do it, just because you can decentralize something doesn’t mean you should. There has to be real value at the end of the day. But just like finance and other parts of the Web3 space, AI can definitely benefit from decentralization.
BlockBeats: What advice would you most like to give to entrepreneurs who want to get into the crypto AI space?
Trevor:I would recommend learning as much as you can and really understanding the tech stack and architecture. You don’t necessarily need to be a PhD in machine learning, but it’s important to understand how it works and do research. From there, you’ll gradually understand the problem more organically over time. That’s key.
If you don’t understand how it works, you can’t understand what the problem is. And if you don’t know what the problem is, you shouldn’t be an entrepreneur because an entrepreneur’s job is to solve problems.
So this is no different than any other startup, you should understand the space. You don’t have to be the world’s leading expert in the field, but you need to know it well enough to be able to understand the problems and then try to solve them.
欢迎加入律动 BlockBeats 官方社群:
Telegram 订阅群:https://t.me/theblockbeats
Telegram 交流群:https://t.me/BlockBeats_App
Twitter 官方账号:https://twitter.com/BlockBeatsAsia