Netflix Data Analysis
Data Science

Netflix Data Analysis

May 20, 2024  •  4 min read

Table of Contents

Business Problem

As one of the largest OTT streaming platforms in the market, Netflix faces the daily challenge of managing an immense volume of content. From deciding which content to produce and upload to determining the optimal timing for releases, the company needs to navigate these decisions while prioritizing profitability.

Solution

To address these challenges, Netflix leverages the expertise of data analysts. These professionals employ a data-driven approach, analyzing historical data to extract insights that inform decision-making. By analyzing past data, data analysts can identify patterns, trends, and viewer preferences, enabling Netflix to determine the types of shows and movies that resonate with their audience. Additionally, data analysis helps the company understand how to optimize its content strategy for different countries, driving business growth globally.

Following is my analysis of the Netflix data provided by Kaggle, using Python, Pandas, Seaborn, and Matplotlib for data analysis and visualization. During this analysis, I sought to address the following questions:

Metric

Since there is not data about views count, user star rating we are going to use the count of content added to Netflix as the metric. We will also use the count of cast, director, rating as measure of popularity

Data Description

The dataset consists of a list of all the TV shows/movies available on Netflix

FieldDescription
Show_idUnique ID for every Movie / TV Show
TypeIdentifier - A Movie or TV Show
TitleTitle of the Movie / TV Show
DirectorDirector of the Movie
CastActors involved in the movie/show
CountryCountry where the movie/show was produced
Date_addedDate it was added on Netflix
Release_yearActual Release year of the movie/show
RatingTV Rating of the movie/show
DurationTotal Duration - in minutes or number of seasons
Listed_inGenre
DescriptionThe summary description

Data Preprocessing

Data Cleaning

Since most of the data columns were nested. I had to unnest the data first. for eg: The cast column had multiple comma separated names. Each of these had to be on separate row.

cast=df['cast'].apply(lambda x: str(x).split(', ')).tolist()
cast_df=pd.DataFrame(cast,index=df['show_id'])
cast_df=cast_df.stack().reset_index(name='cast').drop('level_1', axis=1).set_index('show_id')
cast_df.replace("nan", float('nan'), inplace=True)

Missing Values Handling

There were many missing values for directors, country, and cast that needed to be included. They had to be replaced with “Unknown” instead of using popular replacement techniques like substituting by mean, or mode.

Exploratory Data Analysis

I have conducted Exploratory Data Analysis (EDA) on each column in the dataset. This involved generating histograms, countplots, box plots, and line plots. Additionally, I used pair plots and correlation plots to examine relationships between columns. This EDA process provided me with a deeper understanding of each individual column and the relationships between pairs of columns.

My Recommendations For Netflix

Jupyter Notebook

Here are the links to the Jupyter Notebook containing detailed code, insights, and recommendations:

Following is the embedded version of above notebook hosted on Kaggle