Predicting Star Trek TOS Spoken Lines with scikit-Learn

In a previous post, I collected the transcripts of all the original Star Trek seasons and analyzed linguistic features of Kirk and Spock.  In this post, I train a scikit-Learn model on the Kirk, Spock and McCoy lines spoken in hopes that one of those three speakers can be guessed when given unknown lines.

Why do this?  I am studying machine learning and am trying to come up with exercises!  Also, this came about after going through the classic “20 Newsgroups” prediction tutorial.  My Kirk, Spock, McCoy exercise is similar to that with a bit of pandas data munging first.


The first part takes files of lines spoken by the three ST-TOS characters, tokenizes into sentences and then generates a pandas dataframe with columns for the text line and the speaker.

Next up is to perform the scikit-Learn steps. And the first of those is to split the data into training and testing parts as such:

Next we use scikit-Learn’s CountVectorizer which learns the vocabulary.

By the way, at this point, I also tried the CountVectorizer stop_words hyperparameter, but found that the prediction accuracy was slightly less than the result below.

I also tried the scikit-Learn TfidfTransformer, but again the prediction accuracy was reduced.

So the next step is to use MultinomialNB which learns the association between the text and the speaker.  Here is the code to actually fit the model and generate a prediction based on our train/test split:

When the prediction accuracy is computed we get 66.6%.  Hmmm

Ok. That is so-so…  But let’s try to predict some made-up lines and see how it does.

These lines are pretty obvious to anyone who has watched the series and the model gets them right.  Yay!  Other, more ambiguous lines (in X_test/y_test for instance) are not so easy to predict.  But 66.6% is decent I think. :-)