CUDA programming Masterclass with C
What you’ll learn
All the basic knowladge about CUDA programming
Ability to desing and implement optimized parallel algorithms
Basic work flow of parallel algorithm design
Requirements
Basic C or C++ programming knowladge
How to use Visual studio IDE
CUDA toolkit
Nvidia GPU
Description
This course is all about CUDA programming. We will start our discussion by looking at basic concepts including CUDA programming model, execution model, and memory model. Then we will show you how to implement advance algorithms using CUDA. CUDA programming is all about performance. So through out this course you will learn multiple optimization techniques and how to use those to implement algorithms. Also we will extensively discuss profiling techniques and some of the tools including nvprof, nvvp, CUDA Memcheck, CUDA-GDB tools in the CUDA toolkit. This course contains following sections. Introduction to CUDA programming and CUDA programming model CUDA Execution model CUDA memory model-Global memory CUDA memory model-Shared and Constant memory CUDA streams Tuning CUDA instruction level primitives Algorithm implementation with CUDA CUDA toolsWith this course we include lots of programming exercises and quizzes as well. Answering all those will help you to digest the concepts we discuss here.This course is the first course of the CUDA master class series we are current working on. So the knowledge you gain here is essential of following those course as well.
Overview
Section 1: Introduction to CUDA programming and CUDA programming model
Lecture 1 Very very important
Lecture 2 Introduction to parallel programming
Lecture 3 Parallel computing and Super computing
Lecture 4 How to install CUDA toolkit and first look at CUDA program
Lecture 5 Basic elements of CUDA program
Lecture 6 Organization of threads in a CUDA program – threadIdx
Lecture 7 Organization of thread in a CUDA program – blockIdx,blockDim,gridDim
Lecture 8 Programming exercise 1
Lecture 9 Unique index calculation using threadIdx blockId and blockDim
Lecture 10 Unique index calculation for 2D grid 1
Lecture 11 Unique index calculation for 2D grid 2
Lecture 12 Memory transfer between host and device
Lecture 13 Programming exercise 2
Lecture 14 Sum array example with validity check
Lecture 15 Sum array example with error handling
Lecture 16 Sum array example with timing
Lecture 17 Device properties
Lecture 18 Summary
Section 2: CUDA Execution model
Lecture 19 Understand the device better
Lecture 20 All about warps
Lecture 21 Warp divergence
Lecture 22 Resource partitioning and latency hiding 1
Lecture 23 Resource partitioning and latency hiding 2
Lecture 24 Occupancy
Lecture 25 Profile driven optimization with nvprof
Lecture 26 Parallel reduction as synchronization example
Lecture 27 Parallel reduction as warp divergence example
Lecture 28 Parallel reduction with loop unrolling
Lecture 29 Parallel reduction as warp unrolling
Lecture 30 Reduction with complete unrolling
Lecture 31 Performance comparison of reduction kernels
Lecture 32 CUDA Dynamic parallelism
Lecture 33 Reduction with dynamic parallelism
Lecture 34 Summary
Section 3: CUDA memory model
Lecture 35 CUDA memory model
Lecture 36 Different memory types in CUDA
Lecture 37 Memory management and pinned memory
Lecture 38 Zero copy memory
Lecture 39 Unified memory
Lecture 40 Global memory access patterns
Lecture 41 Global memory writes
Lecture 42 AOS vs SOA
Lecture 43 Matrix transpose
Lecture 44 Matrix transpose with unrolling
Lecture 45 Matrix transpose with diagonal coordinate system
Lecture 46 Summary
Section 4: CUDA Shared memory and constant memory
Lecture 47 Introduction to CUDA shared memory
Lecture 48 Shared memory access modes and memory banks
Lecture 49 Row major and Column major access to shared memory
Lecture 50 Static and Dynamic shared memory
Lecture 51 Shared memory padding
Lecture 52 Parallel reduction with shared memory
Lecture 53 Synchronization in CUDA
Lecture 54 Matrix transpose with shared memory
Lecture 55 CUDA constant memory
Lecture 56 Matrix transpose with Shared memory padding
Lecture 57 CUDA warp shuffle instructions
Lecture 58 Parallel reduction with warp shuffle instructions
Lecture 59 Summary
Section 5: CUDA Streams
Lecture 60 Introduction to CUDA streams and events
Lecture 61 How to use CUDA asynchronous functions
Lecture 62 How to use CUDA streams
Lecture 63 Overlapping memory transfer and kernel execution
Lecture 64 Stream synchronization and blocking behavious of NULL stream
Lecture 65 Explicit and implicit synchronization
Lecture 66 CUDA events and timing with CUDA events
Lecture 67 Creating inter stream dependencies with events
Section 6: Performance Tuning with CUDA instruction level primitives
Lecture 68 Introduction to different types of instructions in CUDA
Lecture 69 Floating point operations
Lecture 70 Standard and Instrict functions
Lecture 71 Atomic functions
Section 7: Parallel Patterns and Applications
Lecture 72 Scan algorithm introduction
Lecture 73 Simple parallel scan
Lecture 74 Work efficient parallel exclusive scan
Lecture 75 Work efficient parallel inclusive scan
Lecture 76 Parallel scan for large data sets
Lecture 77 Parallel Compact algorithm
Section 8: Bonus: Introduction to Image processing with CUDA
Lecture 78 Introduction part 1
Lecture 79 Introduction part 2
Lecture 80 Digital image processing
Lecture 81 Digital image fundametals : Human perception
Lecture 82 Digital image fundamentals : Image formation
Lecture 83 OpenCV installation
Any one who wants to learn CUDA programming from scartch to intermidiate level
Course Information:
Udemy | English | 10h 48m | 3.56 GB
Created by: Kasun Liyanage
You Can See More Courses in the Developer >> Greetings from CourseDown.com