So, I wanted to get author affiliation information from papers on arXiv. arXiv provides with an API to bulk query their database and get information. Following that, I look for the attribute 'arxiv:affiliation' in the html data. Here's the code -
import urllibfrom BeautifulSoup import BeautifulStoneSoup
url = 'http://export.arxiv.org/api/query?search_query=all:astro&start=0&max_results=1000'
data = urllib.urlopen(url).read()soup = BeautifulStoneSoup(data)
#list = soup.findAll('arxiv:affiliation')#for i in range(len(list)):# print list[i].contents
test = [tag.string for tag in soup.findAll('arxiv:aiffiliation')]
Now, the problem I'm having is that I'm getting affiliation of all authors which I want to split into sets of affiliations of authors of a paper, which I'm stuck on at the moment. Once I get that part, I can move on to the next part of this pet project, displaying these relations between the universities based on authors.