Data Engineering: collecting data and inserting into a SQL database


For the project located here, Crypto Data Engineering, I decided to gather some information on the cryptocurrency markets so that I could run an analysis to determine if I could build a stable quant model. The data engineering portion of a project typically comes after determining a hypothesis, setting project plans, and investigating data sources. For this project, I want to see if momentum is a factor in cryptocurrency markets and I want to know if there's an edge to be gained by incorporating an activity measure (Github commits on a cryptocurrency public repo - see the Bitcoin Github repository as an example). To do this I need a list of cryptocurrency with the trading ticker, pricing data, and data on each of the crypto repositories.

Data Sources

After some research, I found 3 data sources to achieve our goals:

  1. List of tradable currencies: Bittrex currencies
  2. Pricing: Alpha Vantage pricing
  3. Github repo information: i.e. Bitcoin's repo information on Coingecko


Obtaining the data

In reference to the Python ETL Process. The first thing we want to do in the script is pull the list of currencies. We do this by calling the Bittrex API directly, easy enough to do in Python with urllib2 library. We then want to pull pricing information for each ticker, so we iterate through the list and call the Alpha Vantage API using the same method. Finally, we need to obtain the meta data on each Github repo, so we iterate through the list of cryptocurrencies and concatenate the cryptocurrency name to the Coingecko site to dynamically pull the repo location. Once we have the repo location, we can then pull any data on the repo from the Github API. For the Github repo information, as we iterate through the list, we call 2 functions: getGitRepos on row 381 and getRepoUpdateDates on row 394. See the each of the functions code below. To pull the repo location data, we have to scrape the Coingecko website. The process of scraping a site entails loading the HTML into Python and finding the data we're looking for by traversing the DOM. Luckily, Beautiful Soap, a Python library, makes this easy for us.

# Find Github repo information for a market
# longName {string} - the crypto market long name
# shortName {string} - the crypto market short name
# return - an object with all repos for a crypto asset
def getGitRepos(longName, shortName):
    data = []
    site = '' + longName + '/developer#panel'
    header = {
       'User-Agent': 'Mozilla/5.0'
    request = urllib2.Request(site, headers = header)
    page = urllib2.urlopen(request)
    soup = BeautifulSoup(page, 'html.parser')

    for node in soup.find_all('a', href=True, text=True):
        if 'github' in node['href']:
                'longName': longName,
                'shortName': shortName,
                'repoName': node.text,
                'repoLocation': node['href']

    return data
# Parse commit information in Github for a repo
# repoName {string} - the name of the repo on Github
# marketName {string} - the name of the crypto market
# githubCredentials {dict} - the key and id used to call data from the github API
# return - an object with dates and authors for all repo commits
def getRepoUpdateDates(repoName, marketName, githubCredentials):
    meta = []
    iterate = True
    index = 1

    while iterate:
        repoApi = '' \
            + repoName \
            + '/commits?page=' + str(index) + '&per_page=100' \
            + '&client_id=' + githubCredentials['clientId'] \
            + '&client_secret=' + githubCredentials['clientSecret']
        data = returnApiJson(repoApi)
        if data:
            for commit in data:
                date = commit['commit']['author']['date'].replace('T', '').replace('Z', '')
                date = datetime.datetime.strptime(date, '%Y-%m-%d%H:%M:%S')
                    'author': commit['commit']['author']['email'],
                    'market': marketName,
                    'repoName': repoName,
                    'updated': date
            index += 1
            iterate = False
    return meta

Loading the data

Once the data is in Pandas DataFrames, we can load it into SQL Server. The schema is located here. To do the load, we use the to_sql method in the Pandas library. This is a fairly simple way to get data into a SQL database, although not the most efficient, so it should only be used for small data loads. All you need to do is pass through the table name, the database engine, and what to do if the table exists (Pandas will actually create a table for you if one does not exist if you specify that behavior, so one could bypass the schema creation for a quick data load, obviously not a best practice solution).

allBittrexBtcMarkets.to_sql('cryptoAllBittrexMarketsExt', env.dbEngine, if_exists = 'append', index = False)