Analysis of IPL data using PySpark

7 min readJul 30, 2024

Analyzing IPL (Indian Premier League) data and building a predictive model using PySpark and Python.

1. Introduction

The IPL (Indian Premier League) cricket data analysis project aims to uncover performance patterns and insights at both the player and team levels. Through refining data types, resolving inconsistencies, and performing feature engineering, this project seeks to deepen the understanding of factors influencing match outcomes and player performances.

Utilizing PySpark in Databricks, the dataset has been transformed and enriched by creating new fields, such as partnership runs, enhancing the analysis.

The insights gained from this analysis are expected to be valuable for developing strategies in team management, player selection, and game planning, contributing to a data-driven approach in the IPL. This could also benefit cricket enthusiasts in building their dream team.

2. Scope and Methodology

The scope of this project encompasses data cleaning and preprocessing, feature engineering, exploratory data analysis, model building and tuning, and the creation of an end-to-end machine-learning model pipeline. These steps ensure a thorough analysis and the development of practical tools for future IPL seasons.

3. Workflow

The project follows a structured workflow, illustrated in Figure 1, which includes six main stages:

Data Collection: Gathering the IPL dataset from Kaggle, which includes two files: matches and deliveries.
Data Cleaning and Preprocessing: Excluding abandoned matches, clarifying the use of the DLS method, correcting schema discrepancies, and ensuring team name consistency to prepare the dataset for analysis.
Exploratory Data Analysis (EDA): Conducting an in-depth analysis to uncover trends, patterns, and insights within the data, providing a solid foundation for feature engineering and modeling.
Data Transformation: Creating new features, such as partnership runs and over phases, and transforming the dataset to enhance the predictive power of the subsequent model.
Model Building: Developing and training machine learning models to predict player and team performance, utilizing a robust pipeline to streamline the process.

4. Data Cleaning and Preprocessing

The IPL data is divided into 2 files: matches and deliveries. This is sourced from Kaggle [1].

🔹Exclusion of Abandoned Matches: Some matches have no data for the columns winner, result, or player_of_match because they were abandoned due to weather conditions.

**Table 1.** Matches that were abandoned and have columns with missing data.

Exclude these rows from our data using the filter() function as they do not contribute valuable information to the analysis and predictive modeling.

**Figure 2**. Removing rows with no results using the **filter()** function.

🔹DLS Method Clarification: The “method” column in the matches dataset indicates whether the DLS (Duckworth-Lewis-Stern) method was used to calculate a new target score due to match interruptions like rain, and has the values “NA” and “D/L”.

For better data clarity, rename the column to “dls_used” using the withColumnRenamed() function and standardize its values to 0/1 using the when() and otherwise() functions, where 1 indicates that the DLS method was used to calculate the new target score. This transformation simplifies the data, making it easier to interpret and analyze the impact of the DLS method on match outcomes.

**Figure 3.** The column name is renamed using the `withColumnRenamed()` function. The column values are updated using PySpark's conditional functions, `when()` and `otherwise().`

🔹Schema Correction: The schema of the “match” dataset was adjusted to ensure accurate data types by manually defining the schema structure.

**Figure 4**. Manually defining the schema changes for the data types of columns result_margin, target_runs, and target_overs using the **StructField** **Datatype**.

🔹Team Name Consistency: Over the years, certain IPL teams have rebranded and changed their names. For example, “Royal Challengers Bengaluru” became “Royal Challengers Bangalore.”

**Figure 5**: Team name changes over the period of the IPL.

To maintain data consistency and clarity, rename the rows with old team names to their new names using replace() function, ensuring uniformity across the dataset.

**Figure 6:** Replacing all occurrences of old team names in the columns `team1`, `team2`**, and** `toss_winner` with the new names using the `replace()` function.

After transforming the dataset, use the cache() function to store the data in memory (if there's enough space). Subsequent transformations or actions then utilize the cached data instead of re-reading from disk and reapplying previous transformations, significantly enhancing compute performance.

5. Exploratory Data Analysis

🏏 Do Specific Teams Prefer to Bat or Field First?

To determine whether a team chooses to bat or field first, we use groupBy() to partition the data by toss_winner.
The pivot() function creates separate columns for bat and field decisions. We then apply agg() to count the occurrences and fillna() replace any null values with 0.
The withColumn() function is used to add new columns total_tosses for normalization while calculating the batting_percentage.
Finally, select() is used to choose specific columns and orderBy() to sort the results by batting percentage.

**Figure 7:** The green bars represent the total number of tosses won by each team, while the line plot shows the percentage of times each team chooses to bat first over fielding.

The data reveals that the Deccan Chargers and Pune Warriors tend to prefer batting first, whereas the majority of other teams favor fielding first.

🏅 Top IPL Players with Most ‘Player of the Match’ Awards

Group the data by the player_of_match column using groupBy(), which allows aggregation of the number of awards each player has received.
Aggregate the grouped data using agg() and counting the number of times each player has won the "Player of the Match" award. The result is given an alias times_won.

**Figure 8:** Bar chart highlighting the top 10 players with the highest number of ‘Player of the Match’ title wins, showcasing their dominance in standout performances across matches.

AB de Villiers has won the ‘Player of the Match’ award the most times, indicating exceptional performance in matches.
CH Gayle, RG Sharma, DA Warner, and V Kohli also have high numbers of awards, showcasing their consistent match-winning performances.

📊 Performance Analysis of Top Batsmen Across IPL Overs

Reshape the deliveries DataFrame using pivot() by turning unique values in the over column into separate columns, which allows us to calculate and analyze average runs per over for each batter.
Retrieve the top players from the top_players_df as a list using the collect() method.
Filter out only the players that are in this list using filter() and isin().

**Figure 9**: This heatmap illustrates the average runs scored per over by each of the top 10 batsmen across all their IPL matches. The X-axis represents the overs in an inning, while the Y-axis lists the top 10 batsmen. The color intensity indicates the average number of runs scored, with darker shades representing higher averages.

AB de Villiers and DA Warner are particularly effective in the final overs, showcasing their ability to accelerate scoring.
YK Pathan is notable for his strong start, making him valuable in setting up the innings.
CH Gayle and V Kohli are consistent performers throughout the innings, maintaining a steady scoring rate.
SP Narine and RA Jadeja exhibit significant variability, making their performance less predictable but potentially impactful.

🔍 Bowlers vs. Batsmen: Uncovering the Toughest Matchups in IPL

Create a window specification using Window.partitionBy('batter') to partition the data by each batsman. Within this partition, sort the bowlers by the number of wickets taken, in descending order using orderBy().
Apply dense_rank() over the defined window to rank bowlers within each batsman partition based on their wicket counts.
Filter the results to retain only the top-ranked bowler for each batsman by selecting rows where denseRank equals 1.

**Figure 10:** Stacked horizontal bar chart showing the top bowlers who took the most wickets against each batsman. The X-axis shows the total wickets, and the Y-axis lists the batsmen. Each bar segment represents a different bowler, with the legend in the top-right corner.

**Figure 11:** Stacked bar chart displaying the top bowlers who have taken the most wickets against each batsman. The X-axis represents bowlers, while the Y-axis shows the total wickets. Each bar segment indicates wickets taken by different batsmen

Sunil Narine has dismissed Rohit Sharma the most, with 9 wickets.
Bhuvneshwar Kumar has notably challenged multiple batsmen, including Ajinkya Rahane, Parthiv Patel, and MS Dhoni.
MS Dhoni has struggled against Pragyan Ojha and Zaheer Khan, each taking 7 wickets.
Virat Kohli has been dismissed 7 times by Sandeep Sharma.
AB de Villiers has been out 6 times by Yuzvendra Chahal.

🏆 Top IPL Bowlers: Total Wickets and Average Wickets Per Match

**Figure 12**: Filter deliveries for wickets and aggregate total wickets by each bowler. Get the total matches played by each bowler for normalization when calculating average wickets per match.

**Figure 13**: This horizontal bar chart highlights the top IPL bowlers based on their average wickets per match. The primary X-axis shows average wickets and the secondary X-axis displays total wickets with a line plot.

Lasith Malinga and Jasprit Bumrah are standout bowlers with the highest average wickets per match, each taking approximately 2 wickets per game.
Yuzvendra Chahal and Dwayne Bravo lead in total wickets, demonstrating their consistent performance over many matches.

📈 Patterns in Dismissal Types and Match Situations

**Figure 14**: Group data by dismissal type and match type to calculate the count of occurrences and average overs, filtering for dismissal types with more than 10 occurrences.

**Figure 15:** Stacked bar chart displaying the frequency of different dismissal types across match types, with the primary Y-axis showing the frequency. A secondary Y-axis, represented by a white dashed line plot, depicts the average overs per dismissal type.

Most Common Dismissal: “Caught” is the most frequent dismissal across all match types, especially in league matches with 7,518 occurrences and an average over of 10.97.
Average Overs by Dismissal: “Bowled” and “LBW” dismissals typically occur earlier (e.g., 8.71 overs in finals) compared to “caught” (12.22 overs) and “run out” (later in innings).
Match Type Impact: Finals and eliminators show slight differences in average overs for dismissals compared to league matches, with “caught” occurring later in finals and “run out” occurring even later in eliminators.
Bowled and LBW in Key Matches: “Bowled” dismissals happen earlier in qualifiers and finals (e.g., 8.71 overs in finals), suggesting more aggressive bowling.

References

[1] https://www.kaggle.com/datasets/patrickb1912/ipl-complete-dataset-20082020