Using SQL to analyze PGA Tour data

Image source: Gregory Shamus

The Dataset

The data in this case study is sourced from the PGA Tour website and has been converted into a database which you can open here. The data is from the 2020 tour season and contains individual player data on 199 of golf’s best players.

The database is two data tables which are described below:

Drives Table

The first data table is the drives data table which contains data on individual player driving data. The table has three columns:

  • Player: the player’s name
  • Average_Drive: the golfer’s average driving distance in 2020, measured in yards
  • Fairway_Percentage: the percent of a player’s drives that landed in the fairway

The data table is 199 rows representing 199 different players who played on the PGA Tour in 2020. As seen in the screenshot of the table above, the table is sorted from longest to shortest average drive.

Strokes Gained Table

The second data table is the strokes gained table. Strokes gained is an advanced metric that measures a player’s performance based on each part of their game: driving, approach shots, and putting. To learn more about strokes gained, read this article on the PGA Tour website.

The table above is a screenshot of the first five rows of the data table. The table contains five rows: the player’s name, the number of rounds the golfer played in 2020, strokes gained from driving, strokes gained from approach shots, and strokes gained from putting.

This table has 193 rows representing the number of players who met the minimum threshold to be considered in this table.

Question 1: Which player had the longest average drive on the PGA Tour in 2020?

-
This question requires us to use the drives data table in the database. The drives table contains the average_drive column which calculates the player’s average driving distance.

To find which player had the longest average drive, we need to pull all columns from the data table and sort it by the average drive column from largest to smallest as seen in the code here:

SELECT * FROM drives
ORDER BY Average_Drive DESC

Answer: If we run the query above, we see the full drives table displayed that is sorted from longest to shortest average drive. The first row of the table is Bryson DeChambeau who had the longest average drive at 322.1 yards per drive, barely beating out Cameron Champ by .1 yards per drive.

Question 2: Which player had the longest average drive of all players who hit more than 60% of fairways?

-
We simply need to add a WHERE statement to our query from question one that filters the Fairway_Percentage column to show players who hit 60% of their fairways.

SELECT * FROM drives
WHERE Fairway_Percentage > 60
ORDER BY Average_Drive DESC

The first line of the code above displays all columns of the drives data table. The second line is our WHERE statement, which acts as a filter to the data table. Lastly, we then sort the data using the ORDER BY statement the same we did in the first question.

Running this query will give us our results. Tommy Fleetwood had the longest average drive of all players who hit 60% of their fairways with an average drive of 309.6 yards while hitting 61.3% of fairways.

Tommy Fleetwood

Question 3: What is the average driving distance of all players on tour?

-
This question means we need to calculate of the average drive column from the drives data table. We can use the AVG command to calculate the average of the Average_Drive column.

SELECT AVG(Average_Drive) FROM drives

Running the query above will provide us the average player driving distance which is 297.18.

Other calculation commands similar to AVG include SUM, which totals the values in a column, and COUNT, used to count how many rows are in the column.

Question 4: Display a list of all players who played more than 60 rounds of golf in 2020. Display only the player’s name and number of rounds played.

-
The first part of this question is to display only the player’s name and number of rounds by writing those two columns in the SELECT statement. The second part is to filter the number of rounds for all players playing more than 60 rounds.

SELECT Player, Rounds FROM strokesgained
WHERE rounds > 60

The key thing to learn from this question is we can choose which columns we want to display. The asterisk in the SELECT statement allows us to choose all columns or we can remove the asterisk and write the exact names of the columns we want to display.

Jordan Spieth

Question 5: Combine the two data tables by matching the player’s names but only display the player’s name, the number of rounds, average driving distance, and fairway percentage columns.

-
To combine data tables in SQL we use JOIN commands. This question requires the INNER JOIN command to combine these two data tables.

INNER JOIN requires a common column to combine the data tables. In this case, the player’s name is the common column.

SELECT strokesgained.player, strokesgained.rounds, drives.average_drive, drives.fairway_percentage FROM strokesgained
INNER JOIN drives
ON strokesgained.player = drives.player

The statement above combines the two data tables and only displays the four columns we are looking for.

Dustin Johnson

Challenge Questions

These questions are designed to be more advanced than the questions above and questions that require second order thinking.

Challenge questions are difficult questions you may see in data analyst interviews or on the job. New SQL learners will likely need to Google the answer to these questions — which is completely fine! Use whatever resources you need to find the answers to the questions.

Challenge 1: Write a SQL query that displays the number of yards on average that Bryson DeChambeau outdrives the average PGA tour player. Name the column “Difference”.

-
We will put the answer to this question at the bottom to give you a chance to solve this question.

The answer will involve using a subquery, which is just using multiple queries in one query.

SELECT 
(SELECT Average_Drive FROM drives WHERE Player = 'Bryson DeChambeau') -
(SELECT AVG(Average_Drive) FROM drives) AS "Difference"

--

--