March Madness is one of the most unpredictable sporting events in the world. Every year, millions of fans attempt to build the perfect bracket, only to watch their predictions unravel with unexpected upsets and Cinderella stories. What if we could leverage data analytics to improve our chances of making accurate predictions? In this article, we’ll use Dataiku, a powerful data science and machine learning platform, to analyze historical NCAA tournament data and uncover patterns that could help us forecast game outcomes. By applying various analytical techniques—such as exploratory data analysis, correlation analysis, and feature engineering—we aim to determine whether data can bring more order to the madness. Along the way, we’ll demonstrate how Dataiku makes it easy to ingest, clean, and analyze data, helping us build the foundation for a predictive model. Ultimately, this analysis will set the stage for our next step: using AI, including ChatGPT, to make game predictions for the 2025 tournament.
Selection Sunday: Finding the Right Data
The ability to accomplish any type of analysis or predictive task depends on the availability and quality of available data. Thankfully, there is a treasure trove of March Madness data that has been collected in the March Madness Data Kaggle project and is updated annually. We’re going to make use of two files from this analysis, Resumes (originally found here) which contains team-level statistics for every tournament team since 2008 and Tournament Matchups, which contains the final score of each matchup from the tournament since 2008.
The Resumes dataset contains a wealth of team metrics that could potentially be useful for predicting tournament success. However, to keep our analysis focused and interpretable, we’ll concentrate on a single key metric: each team’s Elo rating going into the tournament. Elo is a widely used rating system designed to quantify a team’s relative strength based on past performance. We’ve previously applied this method in our NFL Superbowl analysis, and it has proven to be a valuable tool for assessing team competitiveness. While Elo isn’t the only factor that determines a team’s success, it provides a strong baseline for understanding relative performance.
For this analysis, we will download these raw datasets from Kaggle and upload them to a new Dataiku project where we’ll perform the analysis. If you’re new to Dataiku, it’s easy to get set up by either downloading the free local version or signing up for the free 14-day online trial.
Although we skip some of the intermediate Dataiku project steps in this article, you’ll find all the details in the completed project download which is available here.
First Round: EDA
Exploratory Data Analysis (EDA) is a crucial step in any data analysis process, as it helps uncover hidden patterns, relationships, and anomalies within the dataset. By thoroughly exploring the data, we can identify key trends, assess data quality, and determine which variables may have the most predictive power. EDA allows us to form hypotheses and guide subsequent modeling efforts, ensuring that we extract meaningful insights rather than relying on assumptions. In the context of March Madness, EDA helps us understand how simple factors like team seed and Elo rating correlate with tournament success, laying the groundwork for more advanced predictive modeling.
To begin our analysis, we’ll use Dataiku to explore key insights from the Resumes dataset. This dataset provides a historical record of every NCAA tournament team since 2008, including their seed (1-16), number of tournament wins (0-6), and Elo national ranking entering the tournament. By examining these factors, we can start identifying trends and relationships that may help us predict future outcomes. It’s important to note that the Elo metric in this dataset represents a team’s national ranking rather than its raw rating score. Through this initial EDA, we’ll establish a foundational understanding of how seed and Elo ranking influence tournament success—insights that will be critical as we move toward building a predictive model.
Seeds of Tournament Success
One of the most widely used factors in tournament predictions is a team’s seed. By analyzing the Resumes dataset, we can examine how a team’s seed has historically correlated with tournament performance and what patterns emerge from past results.
In the preceding chart, we see the relationship between the team’s seed (1 being a top seed) and the number of wins (or “rounds”) those seeds achieve on average in NCAA tournaments. We have a total of 960 team seasons to analyze, an adequate amount of data to base our analysis on. From this chart, we can observe some very interesting insights - among them:
- 1 seeds in the bracket tend to make it a little past the 3rd round (meaning they win more than three games).
- 16 seeds almost never win a game. 16 seeds are always matched against 1 seeds to start the tournament (called a seeded knockout tournament). Picking a 16 seed to win even a single game is a real shot in the dark!
- There’s a tremendous dropoff from 1 to 2 seeds - over a full round. Logic might lead you to believe that on a scale of 1-16, there is a very small skill difference between a 1 and 2 seed, but in this case, the historical success does not agree.
- Other significant cliffs also exist from the 3 to 4 and 4 to 5 seeds - nearly a half-round dropoff between each.
- If we followed an exponential decay curve (which is what the rest of the data looks like), we’d expect the 6 seeds to be slightly more successful than the 7, but they’re not. In the first round of the tournament, 6 seeds face off against 11s, and for some reason, they don’t fare well. 11s do nearly as well as 7 seeds.
- If you’re trying to find that Cinderella story, it's pretty clear that, even rounding up, anything after the 12th seed doesn’t usually win a game. Of course, it seems that every year there’s a double-digit team or two that makes a deep run, but it certainly is a statistical anomaly when they do.
From this analysis, it’s clear that a team’s seed plays a significant role in determining its likelihood of success in the tournament. Higher seeds, particularly 1s and 2s, have historically advanced further, while lower seeds face an uphill battle, with only a few breaking through with unexpected success.
There are also some unexpected patterns—such as the strong performance of 11 seeds—that challenge conventional wisdom. While seeding provides a useful baseline for predicting tournament success, it doesn’t tell the whole story. To gain a deeper understanding, we need to explore additional factors, such as Elo rankings, to see if they offer further predictive value.
Is Elo the Dark Horse?
Let’s take a look at a similar chart to see if a team’s Elo ranking offers similar predictive value in determining tournament success. Elo is designed to assess a team's relative strength based on past performance, and in theory, it should provide a useful measure of a team’s potential. In this chart, we observe a familiar trend—just like seeding, teams with a lower (better) Elo ranking tend to win more games in the tournament. This suggests that Elo, like seeding, captures important information about team performance.
Some additional key insights emerge from the data. We observe steep drop-offs in tournament success once teams fall outside the top 60 in Elo rankings, with those teams rarely advancing past the first round. This aligns with what we observed in seeding, where higher-seeded teams struggle to make deep tournament runs. But does this mean that Elo and seeding are interchangeable metrics? Or does Elo offer additional predictive value beyond what is already captured in the seeding process?
A natural question arises: Does the NCAA selection committee already factor Elo into their seeding decisions? If seeding and Elo rankings are perfectly correlated, then Elo may not add much unique insight. However, if discrepancies exist between the two, it could suggest that Elo captures something distinct—perhaps a team’s true strength beyond just its tournament seed. To explore this further, let’s look at a specific example where Elo might provide additional predictive power-- the curious case of 11 seeds.
A Cinderella Story: The 11-Seed Anomaly
In the previous section, we uncovered a fascinating anomaly in tournament performance for 11 seeds. While most seed-based trends follow a predictable, seemingly exponential decline in success as seed numbers increase, 11 seeds consistently outperform expectations. This unusual pattern suggests that something beyond just seeding might be influencing their success. To dig deeper, we can isolate 11 seeds and examine their performance based on Elo rankings to see if this additional metric provides further insight.
The following chart focuses solely on 11 seeds and their relationship between Elo ranking and tournament wins. An interesting trend emerges—Elo ranking appears to be a strong predictor of success for 11 seeds. Teams with an Elo ranking higher than 20 perform about as expected, aligning with historical seed-based trends. However, the true Cinderella 11s—the ones that make deep runs—tend to have an Elo ranking inside the top 20. This suggests that not all 11 seeds are created equal. Rather, certain 11 seeds have a much higher chance of making a surprising tournament run, and their Elo ranking helps us distinguish them.
While this visual analysis alone doesn’t provide a definitive predictive model, it does offer a compelling case for further exploration. The strong relationship between Elo and tournament success among 11 seeds indicates that Elo could provide additional predictive value beyond just seeding. This finding encourages us to quantify these relationships more rigorously in the next step of our analysis.
Second Round: Quantifying the Relationships
The visual trends we’ve explored so far suggest a strong relationship between a team’s seed, Elo ranking, and tournament success. Both metrics appear to play a crucial role in determining how far a team is likely to advance. However, visual patterns alone can only tell us so much—we need a more precise way to measure the strength of these relationships.
To take our analysis a step further, we’ll quantify the correlation between seed, Elo ranking, and tournament wins. This will allow us to determine which factor typically has a stronger predictive relationship with success. By applying statistical correlation methods, we can assess whether these metrics are related to wins and how strongly they influence a team's tournament performance. This will help us understand whether seeding alone is a good predictor of success, or if Elo provides unique and valuable insight.
Seeds, Standings, and Spearman Correlation
One effective way to measure this relationship is by using Spearman correlation, which evaluates how well the rankings of two variables align. Spearman correlation assesses monotonic relationships—meaning it identifies whether one variable tends to increase as the other increases, even if the relationship isn’t perfectly linear. This makes it particularly useful for tournament data, where performance metrics like seed and Elo rating may not have a strictly linear impact on wins but still follow a clear ranking-based trend. By calculating the Spearman correlation, we can determine which factor—seed or Elo rating—has a stronger and more consistent association with a team’s success.
Performing a Spearman correlation in Dataiku is an easy task and in the resulting matrix, we can see that both metrics have a highly inverse relationship with wins (as expected). As we may have gleaned from the previous two charts, the team’s seed is more strongly correlated with the number of wins they’ll see in the tournament than their Elo ranking is. This doesn’t mean that the Elo isn’t valuable for making predictions, just that it historically isn’t as individually correlated.
While both seed and Elo rating are strongly correlated with wins, it’s important to note that these two metrics are not perfectly correlated with each other, with a Spearman correlation of 0.841. This suggests that while they share a significant relationship, they are likely capturing different aspects of team strength. As a result, combining both metrics in predictive models could add valuable insights, as each brings unique information to the table. This mirrors what we observed in our analysis of the 11 seeds—where using multiple factors together provided a more complete picture of a team’s potential tournament performance.
The Sweet 16: Data Preparation
After performing this analysis on the Resumes data, we have a grasp on a couple of the interesting factors that influence the depth of the run a team might take in the tournament. Our goal with this analysis is to see if we can predict the outcome of individual games - so in the next step, we’ll join these historical rankings from the Resumes data with the larger dataset containing the actual results of each NCAA tournament game since 2008. Along with the seeds and Elo rankings, we’re also going to pull a couple of other potentially valuable team metrics from Resumes, “B Power” and “WAB Rank. We’ll dig into these more in our next article.
The details of this join are available in the downloadable Dataiku project, but at a high level, we’re joining the datasets, using the team name and tournament year as the join keys.
The result of this join gives us a record including the score of each game in tournament history, along with each team’s seeds, Elo, WAB, and R-Score rankings.
Creating the Features
This snapshot of our data is interesting, but to better understand and describe any given matchup, we may want to generate some more specific features involving the difference between the teams’ seeds, WABs, and Elos. Specifically, we’re going to utilize a formula in Dataiku that subtracts the away team’s value from the home team’s value to give us the differential between the two teams. For example, if the home team is a 1 seed and the away team is a 16, the differential would be -15, indicating a large discrepancy between the seeds in favor of the home team.
Also using a formula, we can create another one-hot encoding feature that indicates whether the home team (team 1) won each game.
Now, with our data joined and prepared, we can look at the new relationships to see if the insights of seed / Elo / wins still hold true in individual games. Performing a Spearman correlation, we observe a significant inverse relationship between both the seed and Elo differential (as well as the WAB and B-Score which we’ll dive into more in the next article) with the winner of any individual game, just as we may have expected.
From Data to Developing a Predictive Model
With our data prepared and key insights uncovered, we’re ready to take the next step—building a predictive model. In our next article, we’ll use the features and correlations identified here to train a model capable of forecasting individual game outcomes. But we’re not stopping there. Once our model is built, we’ll call in Generative AI models to help us fine-tune our predictions and make a model capable of predicting the games of the 2025 tournament.
Stay tuned as we put data-driven bracketology to the test and see if analytics can give us the ultimate edge in predicting March Madness!
