As one of the largest OTT streaming platforms in the market, Netflix faces the daily challenge of managing an immense volume of content. From deciding which content to produce and upload to determining the optimal timing for releases, the company needs to navigate these decisions while prioritizing profitability.
To address these challenges, Netflix leverages the expertise of data analysts. These professionals employ a data-driven approach, analyzing historical data to extract insights that inform decision-making. By analyzing past data, data analysts can identify patterns, trends, and viewer preferences, enabling Netflix to determine the types of shows and movies that resonate with their audience. Additionally, data analysis helps the company understand how to optimize its content strategy for different countries, driving business growth globally.
Following is my analysis of the Netflix data provided by Kaggle, using Python, Pandas, Seaborn, and Matplotlib for data analysis and visualization. During this analysis, I sought to address the following questions:
Since there is not data about views count, user star rating we are going to use the count of content added to Netflix as the metric. We will also use the count of cast, director, rating as measure of popularity
The dataset consists of a list of all the TV shows/movies available on Netflix
Field | Description |
---|---|
Show_id | Unique ID for every Movie / TV Show |
Type | Identifier - A Movie or TV Show |
Title | Title of the Movie / TV Show |
Director | Director of the Movie |
Cast | Actors involved in the movie/show |
Country | Country where the movie/show was produced |
Date_added | Date it was added on Netflix |
Release_year | Actual Release year of the movie/show |
Rating | TV Rating of the movie/show |
Duration | Total Duration - in minutes or number of seasons |
Listed_in | Genre |
Description | The summary description |
Since most of the data columns were nested. I had to unnest the data first. for eg: The cast column had multiple comma separated names. Each of these had to be on separate row.
cast=df['cast'].apply(lambda x: str(x).split(', ')).tolist()
cast_df=pd.DataFrame(cast,index=df['show_id'])
cast_df=cast_df.stack().reset_index(name='cast').drop('level_1', axis=1).set_index('show_id')
cast_df.replace("nan", float('nan'), inplace=True)
There were many missing values for directors, country, and cast that needed to be included. They had to be replaced with “Unknown” instead of using popular replacement techniques like substituting by mean, or mode.
I have conducted Exploratory Data Analysis (EDA) on each column in the dataset. This involved generating histograms, countplots, box plots, and line plots. Additionally, I used pair plots and correlation plots to examine relationships between columns. This EDA process provided me with a deeper understanding of each individual column and the relationships between pairs of columns.
Here are the links to the Jupyter Notebook containing detailed code, insights, and recommendations:
Following is the embedded version of above notebook hosted on Kaggle