Recently the development of robust emotion recognition has been increasingly emphasized in order to handle situations of different cultures and languages. This has become critical due to the potential applicability of emotion recognizers across a wide range of application scenarios. Instead of conventional approach in deriving a single universal emotion recognition module across all languages, we have previously demonstrated a method based on integrating other database’s useful information to improve the emotion recognition of the current data with fusion of multiple emotion perspectives. In this paper, we present an improved framework, i.e., a bootstrapped multi-view weighted kernel fusion, to further advance the recognition accuracies. We have also extended the modeling of speech-only modality to include video information. In specifics, we utilize two emotional corpora of different languages. Our proposed framework obtains improved recognition in regressing activation and valence attributes using audio and video modalities across both of the databases. We not only demonstrate that the weighted kernel fusion can provide additional modeling power but also present analyses on the complementary emotionally-relevant acoustic and visual behaviors computed from the multiple emotion perspectives.