Shaw Scraper

A web scraper built to scrape movie and seat buying data from Shaw Theatres’ website to understand movie-goers’ behaviouristic patterns. The data was used to build interesting visualisations which can be found on the PopcornData website. More details on how we obtained the data and cleaned it can be found in this Medium article.

Data Collected

Raw Data

The complete raw data collected can be found here.

Cleaned Data

The processed data can be found here.

Built With

Getting Started

The scraper was built to run on Heroku. The following instructions are to deploy it on Heroku.

Prerequisites

Installation

  1. Clone the repo and navigate to the correct folder
  git clone https://github.com/PopcornData/shaw-scraper.git
  1. Open your Heroku CLI and login to Heroku
  heroku login
  1. Create a new project on Heroku
  heroku create <project-name>
  1. Add the remote
  heroku git:remote -a <project-name>
  1. Add the Buildpacks necessary for Selenium ChromeDriver
  heroku buildpacks:add --index 1 https://github.com/heroku/heroku-buildpack-python.git

  heroku buildpacks:add --index 2 https://github.com/heroku/heroku-buildpack-chromedriver

  heroku buildpacks:add --index 3 https://github.com/heroku/heroku-buildpack-google-chrome
  1. Add the PATH variable to the Heroku configuration
  heroku config:set GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google_chrome

  heroku config:set CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver

  heroku config:set MONGODB_URL=<your-MongoDB-connection-string>
  1. Deploy to Heroku (Make sure that you navigate to the cloned folder before deploying)
  git push heroku master
  1. Run the following command to start the scraper
  heroku ps:scale clock=1

Usage

The scraper has 2 functions which run separately:

  1. get_movie_data() - This function scrapes the movie details from all the theatres for the given day and stores the JSON data in the DB. The data has the folowing format:
  {
   "theatre":"Nex",
   "hall":"nex Hall 5",
   "movie":"Jumanji: The Next Level",
   "date":"18 Jan 2020",
   "time":"1:00 PM+",
   "session_code":"P00000000000000000200104"
  }
  1. get_seat_data() - This function scrapes the seat details including which seats where bought and the time at which seats where bought for movie sessions. It scrapes data from the previous day so that all the seat data (ticket sales) are updated. It should be run only after running the get_movie_data() function as it updates the JSON in the DB by adding the seat data to it. The updated data has the following format:
   {
       "theatre":"Nex",
       "hall":"nex Hall 5",
       "movie":"Jumanji: The Next Level",
       "date":"18 Jan 2020",
       "time":"1:00 PM+",
       "session_code":"P00000000000000000200104"
       "seats":[
           {   
             "seat_status":"AV",
             "last_update_time":"2020-01-20 14:34:53.704117",
             "seat_buy_time":"1900-01-01T00:00:00",
             "seat_number":"I15",
             "seat_sold_by":""
           },
           ...,
           {  
             "seat_status":"SO",
             "last_update_time":"2020-01-20 14:34:53.705116",
             "seat_buy_time":"2020-01-18T13:12:34.193",
             "seat_number":"F6",
             "seat_sold_by":""
           }
        ]
   }

A full sample updated document in the database can be viewed here.

The functions are scheduled to run daily at the times specified in clock.py. The timings and frequencies of the scraper can be changed by editing the clock.py file.

License

Distributed under the MIT License. See LICENSE for more information.

Team

Disclaimer

This scraper was made as a project to analyse cinema seat patterns. We are in no way affiliated with Shaw Theatres and are not responsible for the accuracy of the data scraped using this scraper. The scraper was developed to scrape data in Jan 2020 from the website and was functional as of June 2020. It may not work as expected as the structure of the website may have changed since.