An analysis of human factors in Hollywood

Project Description

Hollywood is a complex human network. Producers, directors, writers, actors and cinematographers are working closely with each other to create movies. Their individual talents and the level of cooperation between them potentially have huge impacts on the outcome of movies.
My analysis identifies the factors that have statistical significance and quantifies their impacts on the revenue and profitability of movies.

Web Scraping

I scraped the data from Box Office Mojo. As a first time scraper, I definitely spent more than I’d expected. But I learned to add a random sleep time to reduce the probability of being denied of access.

Feature Engineering

This is a crucial step in the modeling process. The quality of features is a decisive factor of the quality of your model. The following features are created in this analysis:
- Number of stars in the cast: cumulative box office gross for all actors in the database are calculated for a rolling 10-year period. If an actor is ranked in the top 100, then this person is deemed to be a star. For a specific movie, we look at the 10-year period ending in the year that’s 2 years prior to its release year (assuming a 2-year production cycle on average) and count the number of stars in the cast (only the top 4 listed actors are considered).
- Director score: this is a measure based on the cumulative box office for the director in the same 10-year period.
- Prior cooperations between director&actor, director&producer, and producer&writer: the assumption is that people who cooperate more than once tend to have great working chemistry which will translate into success in their future collective efforts.
- Prior experience in the same genre: for each genre, we count the number of times that director/actor had exposure to prior to the movie production.

Model Results

The final model is derived from the following steps:
1. Run OLS using the features described above.
2. Perform grid search on Lasso as a guidance on the choice of features.
3. Run a second OLS on the reduced set of features.
4. Use Random forest to rank features by their importance.

The features that are significant at 95% level are listed below in the order of their importance:
- Director score
- Producer & Writer cooperation
- Star presence
- Director & Actor cooperation
- Actors’ prior experiences in Action/Adventure
- Directors’ prior experiences in Sci-Fi/Fantasy

Future Research

The cooperation features turned out to be among the most important factors in the model. One interpretation is that this might be simply due to the fact that many of the second cooperations are sequels and usually they have higher budgets than the original. Therefore, it’s not surprising that these features have high significance. To isolate the sequel effect, we would need to run the analysis on non-sequal movies only.

Although movie gross revenues are relative easy to capture with a straightforward linear model, the ROI is a bit more difficult. The score on the linear model would’ve been low enough that it’s not at all explanatory. A regression model is clearly not the best choice for ROI, classfication or deep learning might be a better option.

Written on October 8, 2016