Detecting and modeling the engagement of a child during an interaction offers meaningful insights into socio-emotional and cognitive state assessment. Previous work has shown that the engagement level of a child during an interaction with a psychologist can be captured from their vocal behavior. In particular global statistical measures on vocal features computed over an entire interaction were associated with the perceived level of engagement. We extend this framework by introducing a new scheme to capture the temporal patterning of vocal features using sequence models of the interacting child-psychologist dyad. We achieve enhanced unweighted accuracies of 73.23% (chance 50.00%) in a classification experiment of distinguishing the most engaged state against others and a three way accuracy of 51.42% (chance 33.33%) in discriminating three levels of perceived engagement using the new set of features.