Using SQL to analyze the top 1000 movies of all time

The Database

The data we will use is one data table that contains a list of the top 1000 rated movies of all time. The data contains 22 columns of data about each movie like the movie’s title, year released, duration, actors, budget, and review data.

View the data table in the form of a Google Sheet in a new tab here. Notice how the data table has 1,000 rows and 22 columns of data. This is a simple database because it only has one data table — remember that relational databases can have hundreds of tables.

Workspace

To analyze the database using SQL, we will be using DB Fiddle to simulate a MySQL database. Open the link to DB Fiddle here.

If you have not used DB Fiddle before, it is fairly intuitive but you may need to spend a few minutes on a tutorial.

Question 1: Display the movies data table.

To pull data from a database we use the SELECT command in SQL. To use this command we need to indicate which data table we are pulling from as well as which columns we want to retrieve.

In this case there is only one data table, the movies data table, and we want to pull all data from it, so we write the following SQL command in the “Query SQL” section of our DB Fiddle simulator.

SELECT * FROM movies

We can run this SQL query by clicking the ‘run’ button in the simulator. The results of the code will display at the bottom of the page. Here is a screenshot of the first row of the data table:

The first row of the movies data table

The asterisk in the SQL code above indicates that we want to select all columns from the data table. What if we want to just pull certain columns from the table?

Question 2: Display only the movie title, year, and duration from the movies table.

This code will be similar to the code above. Start by copying over the previous query and remove the asterisk. Next, add in the exact column names we are looking for: the movie’s title is the original_title column, followed by the year and duration columns.

SELECT original_title, year, duration FROM movies

When you run this code, only these three columns will be displayed.

Question 3: Sort the data table to display the highest rated movies from highest to lowest rating.

When writing this query, display only the movie’s title, year, and rating (the metascore column). To sort the data, we use the ORDER BY command.

SELECT original_title, year, metascore FROM movies
ORDER BY metascore DESC

Running this query displays the list of movies from highest to lowest rating. As we see in the data the top rated movies include “The Wizard of Oz”, “Citizen Kane”, and “The Godfather”.

The top rated movies of all time

Question 4: Write a query that only displays Harry Potter movies.

Similar to previous queries, display only the movie title, year, and metascore. Because we are filtering data, we are using the WHERE command.

SELECT original_title, year, metascore FROM movies
WHERE original_title LIKE 'Harry Potter%'

Because every Harry Potter movie starts with ‘Harry Potter’, like “Harry Potter and the Sorcerer’s Stone”, we can filter the database for all movies that start with the phrase ‘Harry Potter’.

Question 5: What is the highest rated Harry Potter movie?

Now that we filtered the database for Harry Potter movies, we can now sort the data using the ORDER BY command we used from question 3.

SELECT original_title, year, metascore FROM movies
WHERE original_title LIKE 'Harry Potter%'
ORDER BY metascore DESC

Running this query shows the highest rated Harry Potter movie is Harry Potter and Deathly Hallows Part 2 with an 85 overall rating.

Want to learn SQL using real-world applications?

Check out our computer science classes that are tailored to business professionals and students.

--

--