The $2.8 Billion AI Startup Taking On Nvidia

Forbes

Armed with a newly raised $640 million, Groq CEO Jonathan Ross thinks it can challenge one of the world’s most valuable companies with a purpose-built chip designed for AI from scratch.  Read the full story on Forbes: https://www.forbes.com/sites/richardnieva/2024/08/05/groq-funding-series-d-nvidia/  Subscribe to FORBES: https://www.youtube.com/user/Forbes?sub_confirmation=1  Fuel your success with Forbes. Gain unlimited access to premium journalism, including breaking news, groundbreaking in-depth reported stories, daily digests and more. Plus, members get a front-row seat at members-only events with leading thinkers and doers, access to premium video that can help you get ahead, an ad-light experience, early access to select products including NFT drops and more:  https://account.forbes.com/membership/?utm_source=youtube&utm_medium=display&utm_campaign=growth_non-sub_paid_subscribe_ytdescript  Stay Connected Forbes newsletters: https://newsletters.editorial.forbes.com Forbes on Facebook: http://fb.com/forbes Forbes Video on Twitter: http://www.twitter.com/forbes Forbes Video on Instagram: http://instagram.com/forbes More From Forbes:  http://forbes.com  Forbes covers the intersection of entrepreneurship, wealth, technology, business and lifestyle with a focus on people and success.

Transcript

00:00I'm here with Jonathan Ross, the CEO of AI chip startup Grok.

00:04Jonathan, thanks for joining us.

00:06Thanks for having me.

00:07So tell us, just at a very high level, what does Grok do?

00:11So we build the LPU.

00:13You've heard of the GPU, but the LPU is a language processing unit.

00:16And the difference is GPUs are built for highly parallel programs.

00:21Things where you can do a lot of tasks at the same time,

00:24but they're not sequential, they don't rely on each other.

00:27So LPUs are good, for example, with language,

00:30because you can't predict the 100th word until you've predicted the 99th.

00:35And so it's completely unique, but super fast.

00:39And typically, when we show people a demo for the first time,

00:44the response is, wow.

00:46So actually, our website URL, it's not www.grok.com.

00:51It's actually w-o-w.grok.com.

00:54Can you explain inference to a layman?

00:57And what does that mean for a regular person who's using AI?

01:02So every time you go to one of these chatbot websites,

01:05and you type in a query, and you hit Enter,

01:08the result that comes back is inference.

01:10And the difference between that and what you'll typically hear about training

01:15is that, let's use an analogy.

01:18If a doctor, or someone wants to become a cardiologist,

01:22or let's say a cardio surgeon, you spend a lot of years in school

01:27learning how to do that.

01:28That's the training, just like the word sounds, right?

01:32But inference is sort of like performing the surgery,

01:35going in there and doing it.

01:36Now, training is expensive.

01:40But it's nowhere near as expensive as inference.

01:42And this is one of the things that catches almost everyone off guard.

01:45So I remember, at Google, we had actually

01:48trained the world's best speech recognition model.

01:51And we just couldn't afford to put it into production.

01:55So we actually built the TPU in order to get that speed up,

02:02the performance, the amount of compute needed to put it into production.

02:06And just think about it this way.

02:08When you're training a model, it scales with the number

02:10of AI researchers you have.

02:11There aren't too many of those in the world.

02:13When you're doing inference, it scales with the number of end users you have.

02:17And I think, as Sam Altman put it, inference is eye-wateringly expensive.

02:23So we're here to make it cheaper, faster, and more available for everyone.

02:27And so you've made your bet on faster inference.

02:31Why is that important?

02:32And how does it show up for a regular person?

02:36It's a little bit like asking, why do people like fast sports cars?

02:39They just do.

02:40So that's why we put our website up, so you could just go there and try it.

02:44And it's fun watching people try.

02:48The comments that they make, it just comes out so fast.

02:51There's this visceral feeling.

02:53And there were all these studies at Google about speed.

02:56There were studies where they would actually imperceptibly slow down search

03:02to the point where you do an A-B test.

03:04And you actually couldn't, as a human being, say this one was faster

03:07than that one.

03:08But the one that was faster, people used more of, a lot more.

03:11And so even when we can't consciously say this one is faster or this one's

03:16faster, we use more of whatever's faster.

03:18And if you think about it, the last time that you opened up an app

03:23and it just responded really slowly, you're just sitting there,

03:26waiting for the answer.

03:27And your mind drifts.

03:30And then finally, you get the answer.

03:32And you've totally lost your train of thought.

03:36That's why human beings want answers quickly.

03:39It allows them to stay in flow, and they get a lot more done.

03:42And as AI continues to advance, what are some of the things that become possible

03:47because of faster inference?

03:49Well, one of the big ones is agentic use cases.

03:53So we're used to going to the chatbots now and typing in a query and getting an answer.

03:59But that's a single step.

04:02Instead, suppose that you want to book a ticket to, I don't know,

04:06where's a good place to vacation?

04:08Hawaii.

04:09Hawaii.

04:09Okay.

04:10So you want to book a trip to Hawaii.

04:12So you type in, book me a trip to Hawaii.

04:14Well, it then needs to ask, where in Hawaii?

04:18Do you want to sit on the beach?

04:18Do you want to do…

04:19So it has to ask a bunch of questions.

04:20Once it's got all the answers to that, it then has to go and figure out which airline to book,

04:27which hotel to get for you.

04:28Some of them are going to be full.

04:29Some of them won't.

04:30So think about all of the tasks that you have to do to accomplish something.

04:35Those are agentic workloads.

04:38And you can't actually solve it until you've done all of these iterations.

04:44And so the longer it takes to get an answer, the more that compounds.

04:48So we actually had one customer who built a agentic workload.

04:52This is someone who has over 1 billion users.

04:57And it was taking them four to five minutes to actually get a result.

05:01And when they switched to Grok, they actually got it down to 10 seconds.

05:04I think when people think about AI chips, the household name that comes to mind is NVIDIA.

05:12How are your chips different from NVIDIA?

05:15Well, NVIDIA builds what are called GPUs, graphics processing units.

05:19We build what we call LPUs, or language processing units.

05:22GPUs are very good at parallel processing.

05:25So imagine that you wanted to complete some sort of task, like filing your taxes.

05:31Well, just imagine that you could give each page of that to someone else to fill in.

05:35That would be what a GPU does.

05:37But if you want to write a story, you need a coherent arc.

05:40You need to know the beginning, the end, and everything's going to depend on what else happens.

05:44For that, you probably need an LPU because it's sequential.

05:48You can't actually predict the 99th or the 100th word until you've predicted the 99th.

05:52That sequential component is something that GPUs can't do, but LPUs like these are very good at.

05:57And so there's NVIDIA, there's AMD, Intel.

06:00A lot of the legacy companies, and there are also other chip startups like SambaNova and Cerebris.

06:07How do you fit into that ecosystem, and how do you compete with those other players?

06:12Well, one of the main things that we do is we actually make our chips available through a service, Grok Cloud.

06:18And if you go there, you can actually just try it out.

06:21It's super fast.

06:22And then we have an API that allows people to build their own applications on top of it.

06:27So we don't require that you buy these servers and put them in data centers yourself.

06:32We handle all that for you.

06:33It makes it super easy.

06:35In fact, in the last, I think, about 14 weeks, we've gone from fewer than seven developers to over 260,000 developers.

06:44And that's because we made it super easy.

06:46You don't have to do much work.

06:47You just go there.

06:48The API actually matches OpenAI's API.

06:52So your existing code already works.

06:54So you've been at this for a while.

06:56You started in 2016.

07:00And then all of a sudden, November 2022, ChatGBT comes out, and the world kind of discovers generative AI.

07:09How has your business changed since then?

07:12Oh, it's totally transformed things for us.

07:15It's funny because we actually thought we were going to run out of money.

07:20We thought we were going to die.

07:21The thing is, we had built it a little too early.

07:25People didn't need performance inference until LLMs came out.

07:30So all of those image classification models and the like, they were running fast enough that it didn't matter.

07:38But because you have to string all those words together, each word that's computed, the amount of time that it takes compounds.

07:45So if you want to get 600 words out and it takes you two milliseconds each, that's 12 seconds.

07:52Just imagine if you went to Google and you typed a query and you hit enter, and it took 12 seconds to get an answer.

07:58That'd be unusable.

07:59So when these LLMs came out, it made it very easy for people to viscerally get a sense of how fast our hardware was, and it mattered.

08:07Roughly every 100 milliseconds of latency improvement is roughly an 8% increase in engagement.

08:14But we didn't improve things by 100 milliseconds.

08:16We took it from 10 seconds down to one second, which is 90 compoundings of 8%.

08:22We actually spent the first six months at Grok working on the compiler before we ever started designing the chip.

08:27And as far as we know, we're the only people who've ever done that.

08:30Typically, chips are designed by hardware engineers and hardware architects.

08:36And so they start with the chip and then they figure out the software later.

08:40This was a little bit like having a driver design a car.

08:43And then it caused all sorts of headaches in terms of, how do we fit the engine into this weird thing?

08:47Because it wasn't what a mechanic would design.

08:50It wasn't what a hardware engineer would design.

08:52But it actually works much better for the end user.

08:54And when LLAMA 3 came out, we were actually able to get into production the same day that it was released, even though it had not targeted our architecture.

09:03So what do you think your biggest challenges are going forward?

09:06So the biggest challenge is deploying more hardware.

09:09As we were talking about, we have over 200 racks in production now, and we have to get to about 1,300 by the end of the year.

09:20So that's 200 to 1,300.

09:23And so everything that we're doing is about scaling.

09:26And if we were able to do that, that 1,300 racks will actually put us at the same amount of capacity as one of the largest hyperscalers at the end of 2023 had.

09:38So that'll put us in the same running and same scale as a hyperscaler.

09:43So you guys are not the only Grok around.

09:47Elon Musk has a chatbot called Grok.

09:50Has that led to any confusion?

09:53A little bit, and I'll just put it this way.

09:55We call dibs.

09:56We own the trademark.

09:58We call dibs.

09:59You said something really interesting.

10:00Compute is the new oil.

10:01Can you go into that a little bit more?

10:03What does that mean?

10:05The way to think about it is every technological age is based on something, some sort of scarce resource.

10:12The industrial age was built on oil, coal, natural gas, now solar, wind, but energy.

10:21The information age started off with printing press, and eventually we got to the internet and mobile.

10:29And so a question that I used to get when we were fundraising was, is AI going to be the next internet?

10:37Is it going to be the next mobile?

10:38My answer is absolutely not, because those are information age technologies.

10:43This is a generative age technology.

10:45It's different, because whereas information age technologies are about copying data with high fidelity and replicating it and distributing it,

10:53generative age technologies are about creating something new in the moment, in the context of the question that's being asked.

11:01Well, Jonathan, thank you so much for joining me.

11:03Well, thanks for having me.

11:04I appreciate it.

Category

Transcript

Recommended