Analyzing IPL (Indian Premier League) data and building a predictive model using PySpark and Python.
1. Introduction
The IPL (Indian Premier League) cricket data analysis project aims to uncover performance patterns and insights at both the player and team levels. Through refining data types, resolving inconsistencies, and performing feature engineering, this project seeks to deepen the understanding of factors influencing match outcomes and player performances.
Utilizing PySpark in Databricks, the dataset has been transformed and enriched by creating new fields, such as partnership runs, enhancing the analysis.
The insights gained from this analysis are expected to be valuable for developing strategies in team management, player selection, and game planning, contributing to a data-driven approach in the IPL. This could also benefit cricket enthusiasts in building their dream team.
2. Scope and Methodology
The scope of this project encompasses data cleaning and preprocessing, feature engineering, exploratory data analysis, model building and tuning, and the creation of an end-to-end machine-learning model pipeline. These steps ensure a thorough analysis and the development of practical tools for future IPL seasons.
3. Workflow
The project follows a structured workflow, illustrated in Figure 1, which includes six main stages:
- Data Collection: Gathering the IPL dataset from Kaggle, which includes two files: matches and deliveries.
- Data Cleaning and Preprocessing: Excluding abandoned matches, clarifying the use of the DLS method, correcting schema discrepancies, and ensuring team name consistency to prepare the dataset for analysis.
- Exploratory Data Analysis (EDA): Conducting an in-depth analysis to uncover trends, patterns, and insights within the data, providing a solid foundation for feature engineering and modeling.
- Data Transformation: Creating new features, such as partnership runs and over phases, and transforming the dataset to enhance the predictive power of the subsequent model.
- Model Building: Developing and training machine learning models to predict player and team performance, utilizing a robust pipeline to streamline the process.
4. Data Cleaning and Preprocessing
The IPL data is divided into 2 files: matches and deliveries. This is sourced from Kaggle [1].
🔹Exclusion of Abandoned Matches: Some matches have no data for the columns winner, result, or player_of_match because they were abandoned due to weather conditions.
Exclude these rows from our data using the filter()
function as they do not contribute valuable information to the analysis and predictive modeling.
🔹DLS Method Clarification: The “method” column in the matches dataset indicates whether the DLS (Duckworth-Lewis-Stern) method was used to calculate a new target score due to match interruptions like rain, and has the values “NA” and “D/L”.
For better data clarity, rename the column to “dls_used” using the withColumnRenamed()
function and standardize its values to 0/1 using the when()
and otherwise()
functions, where 1 indicates that the DLS method was used to calculate the new target score. This transformation simplifies the data, making it easier to interpret and analyze the impact of the DLS method on match outcomes.
🔹Schema Correction: The schema of the “match” dataset was adjusted to ensure accurate data types by manually defining the schema structure.
🔹Team Name Consistency: Over the years, certain IPL teams have rebranded and changed their names. For example, “Royal Challengers Bengaluru” became “Royal Challengers Bangalore.”
To maintain data consistency and clarity, rename the rows with old team names to their new names using replace()
function, ensuring uniformity across the dataset.
After transforming the dataset, use the
cache()
function to store the data in memory (if there's enough space). Subsequent transformations or actions then utilize the cached data instead of re-reading from disk and reapplying previous transformations, significantly enhancing compute performance.
5. Exploratory Data Analysis
🏏 Do Specific Teams Prefer to Bat or Field First?
- To determine whether a team chooses to bat or field first, we use
groupBy()
to partition the data bytoss_winner
. - The
pivot()
function creates separate columns for bat and field decisions. We then applyagg()
to count the occurrences andfillna()
replace any null values with 0. - The
withColumn()
function is used to add new columnstotal_tosses
for normalization while calculating thebatting_percentage
. - Finally,
select()
is used to choose specific columns andorderBy()
to sort the results by batting percentage.
The data reveals that the Deccan Chargers and Pune Warriors tend to prefer batting first, whereas the majority of other teams favor fielding first.
🏅 Top IPL Players with Most ‘Player of the Match’ Awards
- Group the data by the
player_of_match
column usinggroupBy()
, which allows aggregation of the number of awards each player has received. - Aggregate the grouped data using
agg()
and counting the number of times each player has won the "Player of the Match" award. The result is given an aliastimes_won
.
- AB de Villiers has won the ‘Player of the Match’ award the most times, indicating exceptional performance in matches.
- CH Gayle, RG Sharma, DA Warner, and V Kohli also have high numbers of awards, showcasing their consistent match-winning performances.
📊 Performance Analysis of Top Batsmen Across IPL Overs
- Reshape the
deliveries
DataFrame usingpivot()
by turning unique values in theover
column into separate columns, which allows us to calculate and analyze average runs per over for each batter. - Retrieve the top players from the
top_players_df
as a list using thecollect()
method. - Filter out only the players that are in this list using
filter()
andisin()
.
- AB de Villiers and DA Warner are particularly effective in the final overs, showcasing their ability to accelerate scoring.
- YK Pathan is notable for his strong start, making him valuable in setting up the innings.
- CH Gayle and V Kohli are consistent performers throughout the innings, maintaining a steady scoring rate.
- SP Narine and RA Jadeja exhibit significant variability, making their performance less predictable but potentially impactful.
🔍 Bowlers vs. Batsmen: Uncovering the Toughest Matchups in IPL
- Create a window specification using
Window.partitionBy('batter')
to partition the data by each batsman. Within this partition, sort the bowlers by the number of wickets taken, in descending order usingorderBy()
. - Apply
dense_rank()
over the defined window to rank bowlers within each batsman partition based on their wicket counts. - Filter the results to retain only the top-ranked bowler for each batsman by selecting rows where
denseRank
equals 1.
- Sunil Narine has dismissed Rohit Sharma the most, with 9 wickets.
- Bhuvneshwar Kumar has notably challenged multiple batsmen, including Ajinkya Rahane, Parthiv Patel, and MS Dhoni.
- MS Dhoni has struggled against Pragyan Ojha and Zaheer Khan, each taking 7 wickets.
- Virat Kohli has been dismissed 7 times by Sandeep Sharma.
- AB de Villiers has been out 6 times by Yuzvendra Chahal.
🏆 Top IPL Bowlers: Total Wickets and Average Wickets Per Match
- Lasith Malinga and Jasprit Bumrah are standout bowlers with the highest average wickets per match, each taking approximately 2 wickets per game.
- Yuzvendra Chahal and Dwayne Bravo lead in total wickets, demonstrating their consistent performance over many matches.
📈 Patterns in Dismissal Types and Match Situations
- Most Common Dismissal: “Caught” is the most frequent dismissal across all match types, especially in league matches with 7,518 occurrences and an average over of 10.97.
- Average Overs by Dismissal: “Bowled” and “LBW” dismissals typically occur earlier (e.g., 8.71 overs in finals) compared to “caught” (12.22 overs) and “run out” (later in innings).
- Match Type Impact: Finals and eliminators show slight differences in average overs for dismissals compared to league matches, with “caught” occurring later in finals and “run out” occurring even later in eliminators.
- Bowled and LBW in Key Matches: “Bowled” dismissals happen earlier in qualifiers and finals (e.g., 8.71 overs in finals), suggesting more aggressive bowling.