How can we minimize elevation error between 2 DEMs in an urban area with tall buildings?

First of all, I advise reading this post’s details through to the finish.

The primary thing I want to emphasize is that the outcome shouldn’t be compared to SRTM.

According to my understanding, the answer is:

“We selected the images with the smallest temporal baseline and largest perpendicular baseline.”

The absolute opened selection is not the largest because it may also introduce distortion.

I’d suggest the following:

1- Choose different range of ! 200 m < PB < 500 m.
2- Two pairs are not enough to for comparison task.
3- Both pairs should have the same direction Ace. Or/And Desc. (both).
4- Both pairs should have as closet angles as possible (inci).

Please have a look at the following terms:

“When you void vegetation and man-made features from elevation data, you generate a DEM. A bare-earth elevation model is particularly useful in hydrology, soils, and land use planning”

" But height can come from the top of buildings, tree canopy, powerlines, and other features. A DSM captures the natural and built features on the Earth’s surface."

" (DEM) A digital elevation model is a bare-earth raster grid referenced to a vertical datum. When you filter out non-ground points such as bridges and roads, you get a smooth digital elevation model. The built (power lines, buildings, and towers) and natural (trees and other types of vegetation) aren’t included in a DEM."

Your attempt to obtain the DSM, or the right building height, via subtraction of DEM is where the error is coming from. Building heights are determined by DSM, not DEM.

"
In some countries, a DTM is actually synonymous with a DEM. This means that a DTM is simply an elevation surface representing the bare earth referenced to a common vertical datum.
In the United States and other countries, a DTM has a slightly different meaning. A DTM is a vector data set composed of regularly spaced points and natural features such as ridges and breaklines. A DTM augments a DEM by including linear features of the bare-earth terrain.
"

Note: Car movement doesn’t have that affects as you mentioned.

For the unwrapping process, Snaphu is still not too sophisticated, which could be the cause of the errors.

For more reference regarding the terminology:

Hope this helps.