To develop and train machine learning models, a powerful hardware setup is crucial to ensure fast & efficient training times.
In this blog, we will discuss the recommended hardware requirements for machine learning, specifically focusing on the processor (CPU) and graphics card (GPU).
Our recommendations will be based on generalities from typical workflows. Please note that this is focused on Machine Learning/Deep Learning workstation hardware for programming model “training” rather than “inference.”
Best Processor for Machine Learning
The processor and motherboard define the platform that supports the GPU acceleration in most machine-learning applications. While the GPU is the driving force behind machine learning, the CPU also plays an important role in data analysis and preparation for training. The two recommended CPU platforms for machine learning are Intel Xeon W and AMD Threadripper Pro. These platforms offer excellent reliability, sufficient PCI-Express lanes for multiple video cards (GPUs), and high memory performance.
For non-GPU tasks, the number of cores will depend on the expected load. A minimum of 4 cores per GPU accelerator is recommended, while 16 cores is considered minimal for this type of workstation. The brand choice between Intel and AMD is largely a matter of software compatibility,
Eg: the Intel platform may be preferable if the workflow benefits from the tools in the Intel oneAPI AI Analytics Toolkit.
Best GPU for Machine Learning
Almost any NVIDIA graphics card will work, with newer and higher-end models generally offering better performance. Fortunately, most ML / AI applications with GPU acceleration work well with single precision (FP32).
In many cases, using Tensor cores (FP16) with mixed precision provides sufficient accuracy for deep learning model training and offers significant performance gains over the “standard” FP32. Most recent NVIDIA GPUs have this capability, except for the lower-end cards.
Consumer graphics cards like NVIDIA’s GeForce RTX 4080 and 4090 provide good performance, but may not be suitable for systems with more than two GPUs due to cooling design and physical size.
Professional NVIDIA GPUs like the RTX A5000 and RTX A6000 are high quality, have more onboard memory, and work well in multi-GPU configurations. The RTX A6000, with its 48GB VRAM, is recommended for data with large feature sizes such as high-resolution images and 3D images.
This is dependent on the “feature space” of the model training. Memory capacity on GPUs has been limited and ML models and frameworks have been constrained by available VRAM. This is why it’s common to do “data and feature reduction” prior to training. For example, images for training data are usually of low resolution since the number of pixels becomes a limiting critical feature dimension. However, the field has developed with great success despite these limitations! 8GB of memory per GPU is considered minimal and could definitely be a limitation for lots of applications. 12 to 24GB is fairly common, and readily available on high-end video cards. For larger data problems, the 48GB available on the NVIDIA RTX A6000 may be necessary – but it is not commonly needed.
Storage Congifuration (SSD) for Machine Learning
Storage is one of those areas where buying “more than you think you need” is probably a good idea. The minimum requirements here are similar to CPU memory requirements. After all, your data and projects have to be available!
What storage configuration works best for machine learning and AI?
It’s recommended to use fast NVMe storage whenever possible, since data streaming speeds can become a bottleneck when data is too large to fit in system memory. Staging job runs from NVMe can reduce job run slowdowns. NVMe drives are commonly available with up to 4TB capacity.
Together with the fast NVMe storage for staging jobs, more traditional SATA-based SSDs offer larger capacities that can be used for data that exceeds the capacity of typical NVMe drives. 8TB is commonly available for SATA SSDs.
Platter drives can be used for archival storage and for very large data sets. 18 TB and more capacities are also available.
Additionally, all of the above drive types can be configured in RAID arrays. This does add complexity to the system configuration and may use up slots on the motherboard that would otherwise support additional GPUs – but it can allow for storage space in the 10 to 100s of terabytes.
In conclusion, The recommended CPU platforms are Intel Xeon W and AMD Threadripper Pro, while NVIDIA is the dominant player in GPU compute acceleration. The choice of video card will depend on the specific requirements of the machine learning application, with professional NVIDIA GPUs like the RTX A5000 and A6000 being recommended for multi-GPU configurations and data with large feature size.
We build and ship Custom PCs across India with up to 3 years of Doorstep Warranty & Lifetime Technical Support. We have 3 stores in Hyderabad, Gurgaon & Bangalore. Feel free to visit them or contact us through our toll-free number (1800 309 2944) for a consultation.