Pivotal Movies
Introduction |
---|
Our Method |
The Results |
Conclusion |
Ethics |
The Nerd Corner Extra |
The data for our analysis is gathered from different sources, resulting in one big dataframe, in which we collect all project-relevant information from the datasets.
The CMU Movie Summary Corpus dataset serves as the base dataframe, which we enriched with information from the other sets. It includes the Wikipedia summary of 42’306 movies, along with metadata (movie box office revenue, genre, release date, runtime, language) that has been extracted from Freebase. There is even more information included in this dataset, which is however not of importance for this project.
The dataset Movielens contains 45’466 movies, including their budget, box office revenue and reviews. With the help of this dataset we can fill some gaps in the revenues, as well as add the movie’s budget and create a review score.
The dataset MovieStats contains 7’668 movies, including their budget, box office revenue and reviews, by scrapping IMDb between 1980 and 2010. We include this set to complete missing values and compute more precise metrics.
The Awards dataset includes the annually nominees and award winners of the well known “Oscars”, the awards for artistic and technical merit in the film industry given annually by the Academy of Motion Picture Arts and Sciences (AMPAS). This data helps to bring another measure of quality of the movies to the dataset.
The revenues over all datasets are given in US dollars. Due to inflation, one dollar in 1914 (the earliest date of release for a movie in the dataset) compared to one dollar in 2012 (the most recent movie in the dataset) is not worth the same. The purchasing power of the money has changed and this effect has to be accounted for, if two movies from different moments in time should be compared to each other. The effect of inflation can be accounted for with this additional dataset, relating the worth of US dollar in each year to the worth of US dollar in 1800. This is achieved with the help of this dataset.
These two datasets (1 and 2)contain data for 2.5 million movies and series, listed on the official website of IMDB. They help us to fill gaps in the other dataframes and bring more completeness to our data.
We performed processes of data exploration, cleaning and augmentation by merging the datasets. The exact procedure can be consulted in detail on the project github page.
We ended up with a clean dataset, containing movies in English that have been produced in the USA between 1910 and 2010. The final dataframe includes the following characteristics for each movie: