Monday, May 29, 2023

The Moz Links API: An Introduction

What exactly IS an API? They’re those things that you copy and paste long strange codes into Screaming Frog for links data on a Site Crawl, right?

I’m here to tell you there’s so much more to them than that – if you’re willing to take just a few little steps. But first, some basics.

What’s an API?

API stands for “application programming interface”, and it’s just the way of… using a thing. Everything has an API. The web is a giant API that takes URLs as input and returns pages.

But special data services like the Moz Links API have their own set of rules. These rules vary from service to service and can be a major stumbling block for people taking the next step.

When Screaming Frog gives you the extra links columns in a crawl, it’s using the Moz Links API, but you can have this capability anywhere. For example, all that tedious manual stuff you do in spreadsheet environments can be automated from data-pull to formatting and emailing a report.

If you take this next step, you can be more efficient than your competitors, designing and delivering your own SEO services instead of relying upon, paying for, and being limited by the next proprietary product integration.

GET vs. POST

Most APIs you’ll encounter use the same data transport mechanism as the web. That means there’s a URL involved just like a website. Don’t get scared! It’s easier than you think. In many ways, using an API is just like using a website.

As with loading web pages, the request may be in one of two places: the URL itself, or in the body of the request. The URL is called the “endpoint” and the often invisibly submitted extra part of the request is called the “payload” or “data”. When the data is in the URL, it’s called a “query string” and indicates the “GET” method is used. You see this all the time when you search:

https://www.google.com/search?q=moz+links+api <-- GET method 

When the data of the request is hidden, it’s called a “POST” request. You see this when you submit a form on the web and the submitted data does not show on the URL. When you hit the back button after such a POST, browsers usually warn you against double-submits. The reason the POST method is often used is that you can fit a lot more in the request using the POST method than the GET method. URLs would get very long otherwise. The Moz Links API uses the POST method.

Making requests

A web browser is what traditionally makes requests of websites for web pages. The browser is a type of software known as a client. Clients are what make requests of services. More than just browsers can make requests. The ability to make client web requests is often built into programming languages like Python, or can be broken out as a standalone tool. The most popular tools for making requests outside a browser are curl and wget.

We are discussing Python here. Python has a built-in library called URLLIB, but it’s designed to handle so many different types of requests that it’s a bit of a pain to use. There are other libraries that are more specialized for making requests of APIs. The most popular for Python is called requests. It’s so popular that it’s used for almost every Python API tutorial you’ll find on the web. So I will use it too. This is what “hitting” the Moz Links API looks like:

response = requests.post(endpoint, data=json_string, auth=auth_tuple)

Given that everything was set up correctly (more on that soon), this will produce the following output:

{'next_token': 'JYkQVg4s9ak8iRBWDiz1qTyguYswnj035nqrQ1oIbW96IGJsb2dZgGzDeAM7Rw==',
 'results': [{'anchor_text': 'moz',
              'external_pages': 7162,
              'external_root_domains': 2026}]}

This is JSON data. It's contained within the response object that was returned from the API. It’s not on the drive or in a file. It’s in memory. So long as it’s in memory, you can do stuff with it (often just saving it to a file).

If you wanted to grab a piece of data within such a response, you could refer to it like this:

response['results'][0]['external_pages']

This says: “Give me the first item in the results list, and then give me the external_pages value from that item.” The result would be 7162.

NOTE: If you’re actually following along executing code, the above line won’t work alone. There’s a certain amount of setup we’ll do shortly, including installing the requests library and setting up a few variables. But this is the basic idea.

JSON

JSON stands for JavaScript Object Notation. It’s a way of representing data in a way that’s easy for humans to read and write. It’s also easy for computers to read and write. It’s a very common data format for APIs that has somewhat taken over the world since the older ways were too difficult for most people to use. Some people might call this part of the “restful” API movement, but the much more difficult XML format is also considered “restful” and everyone seems to have their own interpretation. Consequently, I find it best to just focus on JSON and how it gets in and out of Python.

Python dictionaries

I lied to you. I said that the data structure you were looking at above was JSON. Technically it’s really a Python dictionary or dict datatype object. It’s a special kind of object in Python that’s designed to hold key/value pairs. The keys are strings and the values can be any type of object. The keys are like the column names in a spreadsheet. The values are like the cells in the spreadsheet. In this way, you can think of a Python dict as a JSON object. For example here’s creating a dict in Python:

my_dict = {
    "name": "Mike",
    "age": 52,
    "city": "New York"
}

And here is the equivalent in JavaScript:

var my_json = {
    "name": "Mike",
    "age": 52,
    "city": "New York"
}

Pretty much the same thing, right? Look closely. Key-names and string values get double-quotes. Numbers don’t. These rules apply consistently between JSON and Python dicts. So as you might imagine, it’s easy for JSON data to flow in and out of Python. This is a great gift that has made modern API-work highly accessible to the beginner through a tool that has revolutionized the field of data science and is making inroads into marketing, Jupyter Notebooks.

Flattening data

But beware! As data flows between systems, it’s not uncommon for the data to subtly change. For example, the JSON data above might be converted to a string. Strings might look exactly like JSON, but they’re not. They’re just a bunch of characters. Sometimes you’ll hear it called “serializing”, or “flattening”. It’s a subtle point, but worth understanding as it will help with one of the largest stumbling blocks with the Moz Links (and most JSON) APIs.

Objects have APIs

Actual JSON or dict objects have their own little APIs for accessing the data inside of them. The ability to use these JSON and dict APIs goes away when the data is flattened into a string, but it will travel between systems more easily, and when it arrives at the other end, it will be “deserialized” and the API will come back on the other system.

Data flowing between systems

This is the concept of portable, interoperable data. Back when it was called Electronic Data Interchange (or EDI), it was a very big deal. Then along came the web and then XML and then JSON and now it’s just a normal part of doing business.

If you’re in Python and you want to convert a dict to a flattened JSON string, you do the following:

import json

my_dict = {
    "name": "Mike",
    "age": 52,
    "city": "New York"
}

json_string = json.dumps(my_dict)

…which would produce the following output:

'{"name": "Mike", "age": 52, "city": "New York"}'

This looks almost the same as the original dict, but if you look closely you can see that single-quotes are used around the entire thing. Another obvious difference is that you can line-wrap real structured data for readability without any ill effect. You can't do it so easily with strings. That’s why it’s presented all on one line in the above snippet.

Such stringifying processes are done when passing data between different systems because they are not always compatible. Normal text strings on the other hand are compatible with almost everything and can be passed on web-requests with ease. Such flattened strings of JSON data are frequently referred to as the request.

Anatomy of a request

Again, here’s the example request we made above:

response = requests.post(endpoint, data=json_string, auth=auth_tuple)

Now that you understand what the variable name json_string is telling you about its contents, you shouldn’t be surprised to see this is how we populate that variable:

 data_dict = {
    "target": "moz.com/blog",
    "scope": "page",
    "limit": 1
}

json_string = json.dumps(data_dict)

…and the contents of json_string looks like this:

'{"target": "moz.com/blog", "scope": "page", "limit": 1}'

This is one of my key discoveries in learning the Moz Links API. This is in common with countless other APIs out there but trips me up every time because it’s so much more convenient to work with structured dicts than flattened strings. However, most APIs expect the data to be a string for portability between systems, so we have to convert it at the last moment before the actual API-call occurs.

Pythonic loads and dumps

Now you may be wondering in that above example, what a dump is doing in the middle of the code. The json.dumps() function is called a “dumper” because it takes a Python object and dumps it into a string. The json.loads() function is called a “loader” because it takes a string and loads it into a Python object.

The reason for what appear to be singular and plural options are actually binary and string options. If your data is binary, you use json.load() and json.dump(). If your data is a string, you use json.loads() and json.dumps(). The s stands for string. Leaving the s off means binary.

Don’t let anybody tell you Python is perfect. It’s just that its rough edges are not excessively objectionable.

Assignment vs. equality

For those of you completely new to Python or programming in general, what we’re doing when we hit the API is called an assignment. The result of requests.post() is being assigned to the variable named response.

response = requests.post(endpoint, data=json_string, auth=auth_tuple)

We are using the = sign to assign the value of the right side of the equation to the variable on the left side of the equation. The variable response is now a reference to the object that was returned from the API. Assignment is different from equality. The == sign is used for equality.

# This is assignment:
a = 1  # a is now equal to 1

# This is equality:
a == 1  # True, but relies that the above line has been executed

The POST method

response = requests.post(endpoint, data=json_string, auth=auth_tuple)

The requests library has a function called post() that takes 3 arguments. The first argument is the URL of the endpoint. The second argument is the data to send to the endpoint. The third argument is the authentication information to send to the endpoint.

Keyword parameters and their arguments

You may notice that some of the arguments to the post() function have names. Names are set equal to values using the = sign. Here’s how Python functions get defined. The first argument is positional both because it comes first and also because there’s no keyword. Keyworded arguments come after position-dependent arguments. Trust me, it all makes sense after a while. We all start to think like Guido van Rossum.

def arbitrary_function(argument1, name=argument2):
    # do stuff

The name in the above example is called a “keyword” and the values that come in on those locations are called “arguments”. Now arguments are assigned to variable names right in the function definition, so you can refer to either argument1 or argument2 anywhere inside this function. If you’d like to learn more about the rules of Python functions, you can read about them here.

Setting up the request

Okay, so let’s let you do everything necessary for that success assured moment. We’ve been showing the basic request:

response = requests.post(endpoint, data=json_string, auth=auth_tuple)

…but we haven’t shown everything that goes into it. Let’s do that now. If you’re following along and don’t have the requests library installed, you can do so with the following command from the same terminal environment from which you run Python:

pip install requests

Often times Jupyter will have the requests library installed already, but in case it doesn’t, you can install it with the following command from inside a Notebook cell:

!pip install requests

And now we can put it all together. There’s only a few things here that are new. The most important is how we’re taking 2 different variables and combining them into a single variable called AUTH_TUPLE. You will have to get your own ACCESSID and SECRETKEY from the Moz.com website.

The API expects these two values to be passed as a Python data structure called a tuple. A tuple is a list of values that don’t change. I find it interesting that requests.post() expects flattened strings for the data parameter, but expects a tuple for the auth parameter. I suppose it makes sense, but these are the subtle things to understand when working with APIs.

Here’s the full code:

import json
import pprint
import requests

# Set Constants
ACCESSID = "mozscape-1234567890"  # Replace with your access ID
SECRETKEY = "1234567890abcdef1234567890abcdef"  # Replace with your secret key
AUTH_TUPLE = (ACCESSID, SECRETKEY)

# Set Variables
endpoint = "https://lsapi.seomoz.com/v2/anchor_text"
data_dict = {"target": "moz.com/blog", "scope": "page", "limit": 1}
json_string = json.dumps(data_dict)

# Make the Request
response = requests.post(endpoint, data=json_string, auth=AUTH_TUPLE)

# Print the Response
pprint(response.json())

…which outputs:

{'next_token': 'JYkQVg4s9ak8iRBWDiz1qTyguYswnj035nqrQ1oIbW96IGJsb2dZgGzDeAM7Rw==',
 'results': [{'anchor_text': 'moz',
              'external_pages': 7162,
              'external_root_domains': 2026}]}

Using all upper case for the AUTH_TUPLE variable is a convention many use in Python to indicate that the variable is a constant. It’s not a requirement, but it’s a good idea to follow conventions when you can.

You may notice that I didn’t use all uppercase for the endpoint variable. That’s because the anchor_text endpoint is not a constant. There are a number of different endpoints that can take its place depending on what sort of lookup we wanted to do. The choices are:

  1. anchor_text

  2. final_redirect

  3. global_top_pages

  4. global_top_root_domains

  5. index_metadata

  6. link_intersect

  7. link_status

  8. linking_root_domains

  9. links

  10. top_pages

  11. url_metrics

  12. usage_data

And that leads into the Jupyter Notebook that I prepared on this topic located here on Github. With this Notebook you can extend the example I gave here to any of the 12 available endpoints to create a variety of useful deliverables, which will be the subject of articles to follow.

No comments:

Post a Comment