Drift diffusion (or evidence accumulation) models have found widespread use in the modelling of simple decision tasks. Extensions of these models, in which the model's instantaneous drift rate is not fixed but instead allowed to vary over time as a function of a stream of perceptual inputs, have allowed these models to account for more complex sensorimotor decision tasks. However, many real-world tasks seemingly rely on a myriad of even more complex underlying processes. One interesting example is the task of deciding whether to cross a road with an approaching vehicle. This action decision seemingly depends on sensory information both about own affordances (whether one can make it across before the vehicle) and action intention of others (whether the vehicle is yielding to oneself). Here, we compared three extensions of a standard drift diffusion model, with regards to their ability to capture timing of pedestrian crossing decisions in a virtual reality environment. We find that a single variable-drift diffusion model (S-VDDM) in which the varying drift rate is determined by visual quantities describing vehicle approach and deceleration, saturated at an upper and lower bound, can explain multimodal distributions of crossing times well across a broad range vehicle approach scenarios. More complex models, which attempt to partition the final crossing decision into constituent perceptual decisions, improve the fit to the human data but further work is needed before firm conclusions can be drawn from this finding.