CUDA programming Masterclass with C

Learn parallel programming on GPU’s with CUDA from basic concepts to advance algorithm implementations.
CUDA programming Masterclass with C
File Size :
3.56 GB
Total length :
10h 48m



Kasun Liyanage


Last update

Last updated 11/2021



CUDA programming Masterclass with C

What you’ll learn

All the basic knowladge about CUDA programming
Ability to desing and implement optimized parallel algorithms
Basic work flow of parallel algorithm design

CUDA programming Masterclass with C

CUDA programming Masterclass with C-screenshot


Basic C or C++ programming knowladge
How to use Visual studio IDE
CUDA toolkit
Nvidia GPU


This course is all about CUDA programming. We will start our discussion by looking at basic concepts including CUDA programming model, execution model, and memory model. Then we will show you how to implement advance algorithms using CUDA. CUDA programming is all about performance. So through out this course you will learn multiple optimization techniques and how to use those to implement algorithms. Also we will extensively discuss profiling techniques and some of the tools including nvprof, nvvp, CUDA Memcheck, CUDA-GDB tools in the CUDA toolkit. This course contains following sections.                                             Introduction to CUDA programming and CUDA programming model                                             CUDA Execution model                                             CUDA memory model-Global memory                                             CUDA memory model-Shared and Constant memory                                             CUDA streams                                             Tuning CUDA instruction level primitives                                             Algorithm implementation with CUDA                                             CUDA toolsWith this course we include lots of programming exercises and quizzes as well. Answering all those will help you to digest the concepts we discuss here.This course is the first course of the CUDA master class series we are current working on. So the knowledge you gain here is essential of following those course as well.


Section 1: Introduction to CUDA programming and CUDA programming model

Lecture 1 Very very important

Lecture 2 Introduction to parallel programming

Lecture 3 Parallel computing and Super computing

Lecture 4 How to install CUDA toolkit and first look at CUDA program

Lecture 5 Basic elements of CUDA program

Lecture 6 Organization of threads in a CUDA program – threadIdx

Lecture 7 Organization of thread in a CUDA program – blockIdx,blockDim,gridDim

Lecture 8 Programming exercise 1

Lecture 9 Unique index calculation using threadIdx blockId and blockDim

Lecture 10 Unique index calculation for 2D grid 1

Lecture 11 Unique index calculation for 2D grid 2

Lecture 12 Memory transfer between host and device

Lecture 13 Programming exercise 2

Lecture 14 Sum array example with validity check

Lecture 15 Sum array example with error handling

Lecture 16 Sum array example with timing

Lecture 17 Device properties

Lecture 18 Summary

Section 2: CUDA Execution model

Lecture 19 Understand the device better

Lecture 20 All about warps

Lecture 21 Warp divergence

Lecture 22 Resource partitioning and latency hiding 1

Lecture 23 Resource partitioning and latency hiding 2

Lecture 24 Occupancy

Lecture 25 Profile driven optimization with nvprof

Lecture 26 Parallel reduction as synchronization example

Lecture 27 Parallel reduction as warp divergence example

Lecture 28 Parallel reduction with loop unrolling

Lecture 29 Parallel reduction as warp unrolling

Lecture 30 Reduction with complete unrolling

Lecture 31 Performance comparison of reduction kernels

Lecture 32 CUDA Dynamic parallelism

Lecture 33 Reduction with dynamic parallelism

Lecture 34 Summary

Section 3: CUDA memory model

Lecture 35 CUDA memory model

Lecture 36 Different memory types in CUDA

Lecture 37 Memory management and pinned memory

Lecture 38 Zero copy memory

Lecture 39 Unified memory

Lecture 40 Global memory access patterns

Lecture 41 Global memory writes

Lecture 42 AOS vs SOA

Lecture 43 Matrix transpose

Lecture 44 Matrix transpose with unrolling

Lecture 45 Matrix transpose with diagonal coordinate system

Lecture 46 Summary

Section 4: CUDA Shared memory and constant memory

Lecture 47 Introduction to CUDA shared memory

Lecture 48 Shared memory access modes and memory banks

Lecture 49 Row major and Column major access to shared memory

Lecture 50 Static and Dynamic shared memory

Lecture 51 Shared memory padding

Lecture 52 Parallel reduction with shared memory

Lecture 53 Synchronization in CUDA

Lecture 54 Matrix transpose with shared memory

Lecture 55 CUDA constant memory

Lecture 56 Matrix transpose with Shared memory padding

Lecture 57 CUDA warp shuffle instructions

Lecture 58 Parallel reduction with warp shuffle instructions

Lecture 59 Summary

Section 5: CUDA Streams

Lecture 60 Introduction to CUDA streams and events

Lecture 61 How to use CUDA asynchronous functions

Lecture 62 How to use CUDA streams

Lecture 63 Overlapping memory transfer and kernel execution

Lecture 64 Stream synchronization and blocking behavious of NULL stream

Lecture 65 Explicit and implicit synchronization

Lecture 66 CUDA events and timing with CUDA events

Lecture 67 Creating inter stream dependencies with events

Section 6: Performance Tuning with CUDA instruction level primitives

Lecture 68 Introduction to different types of instructions in CUDA

Lecture 69 Floating point operations

Lecture 70 Standard and Instrict functions

Lecture 71 Atomic functions

Section 7: Parallel Patterns and Applications

Lecture 72 Scan algorithm introduction

Lecture 73 Simple parallel scan

Lecture 74 Work efficient parallel exclusive scan

Lecture 75 Work efficient parallel inclusive scan

Lecture 76 Parallel scan for large data sets

Lecture 77 Parallel Compact algorithm

Section 8: Bonus: Introduction to Image processing with CUDA

Lecture 78 Introduction part 1

Lecture 79 Introduction part 2

Lecture 80 Digital image processing

Lecture 81 Digital image fundametals : Human perception

Lecture 82 Digital image fundamentals : Image formation

Lecture 83 OpenCV installation

Any one who wants to learn CUDA programming from scartch to intermidiate level

Course Information:

Udemy | English | 10h 48m | 3.56 GB
Created by: Kasun Liyanage

You Can See More Courses in the Developer >> Greetings from

New Courses

Scroll to Top