Developing algorithms to automatically assess mental constructs using media data of human behaviors is becoming important, especially relevant for mental health applications. In this work, we focus on a critically prevalent neurodevelopment disorder, i.e., the autism spectrum disorder (ASD). While researchers have worked on automatic differentiation of ASD from healthy control using a variety of behavior modalities, few works have modeled the severity of ASD behavior symptoms in the existing clinical practice. Thus, we propose to learn a converse-level multimodal (speech and text) embedding derived during a severity assessment interview, i.e., the Autism Diagnosis Observation Schedule (ADOS), that considers the intricate interaction behaviors between the investigator and the participant. Further by fusing two attentional GRUs with this multimodal embedding, our approach achieves an averaged regression score of 0.567 on four items of socio-communicative constructs in the ADOS. Our analysis results suggest that the number of words uttered by both the investigator and the participant is a major predictor.