MIR: parametric encodings of large audio databases for search, beat matched hybridisation, query by humming etc. Move from per song analysis to note and phrase level.
Audio and video render farms to individually watermark large media databases.

Sound Reproduction
Increased channel count in typical musical performances from 4-8 to 100's of speakers.

Spatial audio measurment using large microphone arrays.

Audio restoration and media databases
Large correlation engines to combine information from independent media into higher quality recording. Metadata extraction by analyzing performance style.

Spatial audio:
Thousands of real-time convolutions for walls of speakers wavefield synthesis.
Large convolution/delay/multiply arrays for VRAS supported acoustics.
Synthesis of directional audio with spherical arrays and/or wavefield/VRAS hybrids (100's of directional waves/source 100 sources e.g. orchestra)
100 source HRTF style headphone spatial synthesis

Deferred, context sensitive mixing of recordings (100 channels instead of 6). Convolutions with dynamic coefficients required for each source.

Physical modeling

Non-linear systems need higher sample rates for better control and fidelity (e.g. waveguide, FM etc) 10x audio sample rates.
Tractable real-time models are only possible by approximating instruments and vocal tract with dimensionally-reduced models where
we pretend there is only one sound source and there is a rotational symmetry. Tone holes require a minimum of 10x more computation. The extra dimension
makes things go from squared to cubic or fourth power in the model size. Large instruments like pianos and harps are still innacurately simulated.

Real-time source separation:

Better representations in the iniitial preprocess and feature detection steps require 10x to 100x current computation rates. The final correlation phase
is the big weakness. On-line machine learning here is likely to be large and computationally expensive.

real-time analysis of polyphonic instruments. Harp 30-40 separate channels, Piano 88, guitars etc 4-12. HIgher robustness for viability requires larger and more expensive machine learning methods.

Massive database parametric concatanative synthesis:

Hybridization of parameters will be 4-10x current methods. Additive synthesis instead of resampling will require 300-1000 partials per voice.

Acoustic modeling of rooms, speakers and musical instruments:

Large boundary value problems.

Video tracking of musicians gestures (conductors for example)

Requires jump from 10mS sample rate to sum 1mS with associated 10x cost in machine vision/optical flow style calculations.

Score following:

OCR for music:

Fault Tolerance:
Multiple copies of applications / modules for protection against scheduling fault and software bugs.