Building a Simpsons IMDb Rating Predictor Using Machine Learning

toldham2
May 16, 2023
2 min read

Highlights: Advanced Data Processing, ML-assisted Data Cleaning, and Regression

Every fan of The Simpsons knows that the quality of episodes can vary. Some episodes are universally loved, while others don't quite hit the mark. But can we quantify these differences? Can we predict an episode's IMDb rating based on its features? I sought to answer these questions in this project by building a Simpsons IMDb Rating Predictor using machine learning.

Problem Statement and Hypothesis

This project aimed to predict IMDb ratings for Simpsons episodes based on various features, such as the characters and locations featured in an episode. Given the complexity of storytelling, it's a challenging task. The hypothesis was that narrative storytelling might be too complex to be predicted based on these elements alone.

The Dataset

The data used in this project was the Kaggle Simpsons dataset, which comprises four tables containing information about episodes, script lines, characters, and locations.

Data Preparation and Preprocessing

The first step was to combine these separate tables into a usable table. Then, I applied one-hot encoding to the characters and locations by episode to prepare the data for machine learning algorithms.

Feature Engineering and Selection

To ensure the model didn't overfit and was interpretable, I used Lasso Feature Selection. This technique shrinks the coefficients of less important features to zero, effectively excluding them from the model. Through this process, I removed over 7,000 less significant characters and locations from the dataset.

Results of Feature Selection

The feature selection process revealed some interesting insights. Characters like Maude Flanders, Milhouse Van Houten, Lionel Hutz, and Troy McClure and locations like Simpson's Backyard seemed to influence episode ratings positively. On the other hand, characters like Jimbo Jones and Dolph, and locations like Springfield Town Square, were associated with lower ratings.

The Influence of Time

Another interesting finding was that episodes have received consistently fewer ratings as the series continues.

Building the Predictor

I used a Grid Search approach for the prediction model to find the best combination. I tested three different I used Grid Search approach for the prediction model techniques: XGBRegressor alone, XGBRegressor with Principal Component Analysis (PCA), and XGBRegressor with PCA and a StandardScaler. The Root Mean Square Error (RMSE) was used to evaluate the performance of these models.

The RMSE for the three models were as follows:

XGBRegressor alone: 0.432
XGBRegressor with PCA: 0.448
XGBRegressor with PCA and StandardScaler: 0.521

Final Results

After comparing the models, I chose the XGBRegressor alone due to its lowest RMSE of 0.432. This model was used for the final test, resulting in a final RMSE of 0.471.

Conclusion

Ultimately, the prediction model performed as expected per the initial hypothesis. While it could provide some predictive power, the complexity of narrative storytelling was challenging to capture fully based on episode features alone. For future work, Natural Language Processing (NLP) could be used to analyze the script lines and possibly improve the model's predictive power.

This project was an insightful and fun exploration into the world of The Simpsons through the lens of data. It showcased the power and limitations of machine learning in predicting complex outcomes.

Jupyter Notebook: