It's a bit misleading the way it describes the 64KB RAM, stating that it has to hold the entirety of music and sound effects for the whole game. Most games will upload whole songs with all the required samples as they need them, so it's fairly common to overwrite the old data and start afresh.
Say you're on the map screen and Simian Segue is playing. That more or less means all the stuff required to play Simian Segue is loaded and playing. When you enter a level, the CPU sends a command for the song to stop, then it will upload the data needed for the new song. If you go back to the map screen it will upload and play Simian Segue again.
It's rather space constrained for what you can fit in it at a given time, but you can usually put new things in as you need them.
A good chunk of the 64KB is typically used for samples, but the program uploaded to the SMP has to fit in there as well and that's the most important part. Most games use the echo buffer for echo effects which can eat up a lot of space too, depending on the duration.
The video mentions the NES, and that the SNES is a rather big upgrade. Just to give some context, the NES has very limited sample capabilities and is largely tone based. For the most part it only plays simple tones (square, triangle, noise), so music and sound effects are made by altering pitch and volume of each channel to get what you want. The SNES on the other hand has the sample based DSP. You upload samples and play samples, but it can still get complicated rather quickly. The NES scene has a popular tracker called
famitracker that a lot of people use. Things are a bit different on SNES though. I think it's more common to use a general tracker, then convert and sequence as needed using a much looser toolchain.
The video is a bit vague on what the DSP can do. It's got a bunch of different ways to alter pitch at different rates and some games do very crazy things with it. Some games make exceptionally good use of stereo. You have eight channels, master volume and echo volume each of which can different volumes for left and right sides, and there is a volume envelope in there too which pairs with pitch. Some games abuse the interpolated sample decoding to achieve weird effects. Some use pitch modulation for the same. Some unusual games even stream audio, regularly uploading new samples as the song is being played, with the most famous one being Terranigma. Square's RPGs tend to be big sources of such DSP abuse. The DSP also has a noise generator, but I'll be damned if I can find a game that uses it.
I think the DKC games are rather by-the-book by comparison, at least with how they use the DSP. Nothing wrong with that though.