BC: Using the Twitter API – PhDigital Bootcamp

We are going to use some existing code to work with the Twitter API. Once you get the hang of what you are doing and what you can modify, you can use these scripts to do your own searches from Twitter. These scripts are written in Python, but you don’t have to be a Python coder to use them. You just need to be able to follow along and make modifications.

You will need to install the OAuth 2 library. If you are doing this on your own computer, you will use your login password when you run this command.

Go to the Terminal:
$ sudo easy_install oauth2

For all files, you will need some identifying information from Twitter. Go to developer.twitter.com and apply for a developer account. You will sign in with your Twitter account. You will have to wait for Twitter to approve you. Once approved, you can go to your Dashboard, and choose Apps under your account name. Use the Create New App Button.

When you create a new app, you will get the following information under Apps, Keys and Tokens.

consumer_key
consumer_secret
token_key
token_secret

Basic Search

The first script lets you work with the Twitter API to pull 100 most recent results from Twitter for a search term. This is Twitter’s limit for a basic API call. Save the code in a file named tweet_basic.py.

You can modify the searchterm and searchterm short for your own search. Use %23 to represent a hashtag. You will see how the script concatenates to the Twitter API url to create the API query.

import oauth2 as oauth        # oauth authorization needed for twitter API
import json                   # converting data into json object 
from pprint import pprint     # pretty print 
    
# construct search url
# baseurl is the twitter api
# use %23 to represent hashtag symbol
# count is 100, twitter default without using a loop to get more
# play with changing the search term
# I added searchTermShort to concatenate in filename. See below.

baseurl = "https://api.twitter.com/1.1/search/tweets.json"
searchterm = "%23sxsw"
searchTermShort = "sxsw"
count = "100"

# the url we are using includes baseurl plus ?q to open query plus search term, adding & and count
url = baseurl + '?q=' + searchterm + '&' + 'count=' + count

# my keys, need all four of them. Use your own keys here.
consumer_key = "USE YOUR CREDENTIALS"
consumer_secret = "USE YOUR CREDENTIALS"
token_key = "USE YOUR CREDENTIALS"
token_secret = "USE YOUR CREDENTIALS"

# set up oauth tokens
token = oauth.Token(token_key, token_secret)
consumer = oauth.Consumer(consumer_key, consumer_secret)

# create client and request data from url
client = oauth.Client(consumer, token)
header, contents = client.request(url, method="GET")

# write retrieved data to file 
# concatenate searchTermShort in filename
filename = searchTermShort + '_tweets.json'
localfile = open(filename, 'w');
localfile.write(contents);

# convert to json object 
data = json.loads(contents)

# print meta data on search results 
pprint(data['search_metadata'])

To run it, make sure you have the file on your computer and that you are in the folder for that file (cd to that folder) in the Terminal. Make sure your credentials are in the script. You can open it in the html editor to check and add anything. Then run it. Remember that the $ indicates the Terminal prompt, so you don’t include it.

$ python tweet_basic.py

You will then see a .json file in your folder. You can use a json to csv converter at http://konklone.io/json/ to convert your file to a csv. Then you can read it into Excel.

More Advanced Search

You can use the code below to get more results. This is a nice file because it prompts you for the search term and the number of results you want. Then it creates one file each for 100 tweets, each time it goes through the loop. The loop modifies the max_id so that the script goes back and gets batches of 100 tweets at a time, based on your query.

Name this file tweet_mult_set.py

import oauth2 as oauth        # oauth authorization needed for twitter API
import json, re               # converting data into json object, using replace 
from pprint import pprint     # pretty print 

# function to construct url for search query 
# max_id is set to 0 if only two arguments are passed 
def makeurl(searchterm, count, max_id=0) :
    baseurl = "https://api.twitter.com/1.1/search/tweets.json"
    if max_id == 0:
        url = baseurl + '?q=' + searchterm + '&' \
              + 'count=' + count    
    else:
        url = baseurl + '?q=' + searchterm + '&' \
              + 'max_id=' + str(max_id) + '&' \
              + 'count=' + count    
    return url 

#request input from the command line, must import re for the substitutions
my_raw_searchterm = raw_input(' * What are your search terms? ')
my_searchterm = re.sub(r'#','%23', my_raw_searchterm)
searchTermShort = re.sub(r'#', '', my_raw_searchterm)
searchTermShort = re.sub(r' ', '', searchTermShort)

# determine loop count 
MAX_RESULTS_FROM_TWITTER = 100
desired_max_count = input(' * How many tweets do you want? ')

url = makeurl(my_searchterm, str(MAX_RESULTS_FROM_TWITTER))

# my keys, need all four of them. Use your own keys here.
consumer_key = "USE YOUR CREDENTIALS"
consumer_secret = "USE YOUR CREDENTIALS"
token_key = "USE YOUR CREDENTIALS"
token_secret = "USE YOUR CREDENTIALS"

# set up oauth tokens
token = oauth.Token(token_key, token_secret)
consumer = oauth.Consumer(consumer_key, consumer_secret)

# create client and request data 
client = oauth.Client(consumer, token)

loopcount = desired_max_count / MAX_RESULTS_FROM_TWITTER 

for i in range(loopcount):
# loop to pull results 100 at a time 

    # fetch search results 
    header,contents = client.request(url, method="GET")
    
    # convert results to json object 
    data = json.loads(contents)

    # write retrieved data to a unique file based on i
    # will overwrite files of same name
    filename = searchTermShort + str(i) + '.json'
    localfile = open(filename, 'w');
    localfile.write(contents);
     
    # find number of search results 
    results = len(data['statuses'])

    # find the oldest tweet
    next_id = data['statuses'][results-1]['id']
    
    # get date for oldest tweet and print
    oldest_tweet_date = data['statuses'][results-1]['created_at']
    print(oldest_tweet_date)
    
    # construct search query to fetch tweets older than the current tweet 
    url = makeurl(my_searchterm, str(MAX_RESULTS_FROM_TWITTER), next_id)

Make sure your credentials are in the script. Run it.

$ python tweet_mult_set.py

Respond to the prompts. Look at your folder for the files.

Working with the files

I also have a script that lets you convert your json files to csv. Use the code below and name the file convert.py, and give it the base name of the files to convert when you run it. For example, we have several files with the base name sxsw.json (sxsw0.json, sxsw1.json, sxsw2.json)

import json                   # converting data into json object 
from pprint import pprint     # pretty print 
import string                 # extracting filename
import sys                    # command-line arguments
import os                     # os functions

if len(sys.argv) != 2 : 
    print("usage : ")
    print("\t python tweet_geo.py filename")
    sys.exit()
    
# include number of files you are converting
# this will work for up to 100 files. Change numfile if you have more
numfile = 100;    

#goes through each file starting with 0 in filename 
for j in range(numfile):
	# grab base filename from command line
	filename = sys.argv[1]

	# construct output csv filename from json file
	# if json file is foo.json then csv file is foo.csv
	splitname = string.split(filename, '.')
	fields = len(splitname)

	ext = splitname[fields - 1]
	
	#adds j to csv name to create separate file for each
	name = splitname[0] + str(j)
    
    # checks to see if file exists before proceeding
	if os.path.isfile(name + '.json'):
		csvfilename = name + ".csv"
		csvfile = open(csvfilename, 'w');
	
		# open file and load data, uses csv name to open each file, starting with 0
		localfile = open(name + '.json', 'r');
		data = json.load(localfile)

		# get number of tweets in file 
		tweets = len(data['statuses'])

		for i in range(tweets):
			# text of the tweet 
			tweet_text = data['statuses'][i]['text']
	
			# date tweet was sent 
			date = data['statuses'][i]['created_at']

			# full name of user 
			name = data['statuses'][i]['user']['name']
	
			# username (without the @)
			username = data['statuses'][i]['user']['screen_name']
	
			# combine and create CSV text 
			csv_text = name + ","  + username + "," + date + "," + "\"" + tweet_text + "\"" + "\n"

			# dump to file 
			csvfile.write(csv_text.encode('utf8'))

$ python convert.py sxsw.json

The above is an example that converts files with sxsw in the name. Then you can use a basic Terminal command to concatenate all these into one csv.

$ cat *csv > sxsw_combine.csv

Name the resulting file whatever you want. You can open this now in Excel.

Other Tools

Tag Crowd

The Tag Crowd site makes quick word visualizations. It’s easy to use and flexible. Go to tagcrowd.com and insert your text. Then run the visualization. You may have to exclude common words before the visualization is meaningful. I like to show frequencies and display 100 words maximum. Play with the settings to get the right visualization for your topic.

Word Frequency

One of the things you might want to do is run a word frequency script to determine which words are used the most in passages of text. This is similar to what sites like TagCrowd.com do when they want to visualize terms in a word cloud. But you can use a Python script to get word counts. You might want to use this in some manner in your analysis, like using the data in a chart on your site.

Copy the text you want to analyze and put it in a txt file. Run this script in the Terminal with Python. It will ask for an input file (the txt file that includes your text) and an output file (what you want to name the file that will include the word counts. Give it a .txt extension).

Name this file wordfreq.py.

from re import compile
l=compile("([\w,.'\x92]*\w)").findall(open(raw_input('Input file: '),'r').read().lower())
f=open(raw_input('Output file: '),'w')
for word in set(l):
    print>>f, word, '\t', l.count(word)
f.close()

$ python wordfreq.py

You can then open the file in spreadsheet program and sort the frequencies. Once you get past the common words like a, an, the, you will start to see meaningful words used in the text.

Now you have a basic understanding of API concepts. Practice using these techniques and do some research on how to use other APIs.