Dave we’d love to push our inference needs to a client’s PC, but here we are streaming tokens from the foundational model builders. New insight, we dropped gpt-5 in and it adds about a minute to inference time. Customers like it because the results are better. Also because the extra minute gives the “feeling” of a higher quality response. So not only, are we really better than b4, there is a psychic benefit too for our clients.
Centralized inference makes sense now. But (1) I don’t think it will stay that way for a class of inference problems and (2) there will always be inference which has to be performed in the cloud due to complexity, size, etc.
Inference isn't a zero sum world. Think of it like wifi and cellular data. Wifi bled off a lot of demand from cellular networks, but new demands for cellular data traffic replaced the traffic that wi-fi bled off. Similarly, simple inference tasks will go on to devices like mobile phones, tablets, sensors, etc., but complex, large-scale inference will remain in the cloud, and that complex inference will absorb all excess capacity.
Dave we’d love to push our inference needs to a client’s PC, but here we are streaming tokens from the foundational model builders. New insight, we dropped gpt-5 in and it adds about a minute to inference time. Customers like it because the results are better. Also because the extra minute gives the “feeling” of a higher quality response. So not only, are we really better than b4, there is a psychic benefit too for our clients.
Centralized inference makes sense now. But (1) I don’t think it will stay that way for a class of inference problems and (2) there will always be inference which has to be performed in the cloud due to complexity, size, etc.
What is the implication of this trend for NVIDIA? Will there still be such a high demand for GPU two years from now?
Inference isn't a zero sum world. Think of it like wifi and cellular data. Wifi bled off a lot of demand from cellular networks, but new demands for cellular data traffic replaced the traffic that wi-fi bled off. Similarly, simple inference tasks will go on to devices like mobile phones, tablets, sensors, etc., but complex, large-scale inference will remain in the cloud, and that complex inference will absorb all excess capacity.
Would love to hear your thoughts on interpetability in the context of enterprise adoption in high value use cases.