Last Update: 20.12.2006. By kerim in python
Now for those that are interested in HOW I did it. It’s actually pretty simple.
First of all pick a site that you want to download content from. I look around but since as said earlier i didnt find many “normal” ones i have choosen youporn.com.
Two reasons:
1) Plenty of content
2) Rather simple way of getting at it.
Here is how to do it:
1) We start at the base URL www.youtube.com
and simply read the html
def fetchpage(self,url):
pagefile= urllib.urlopen(url) #enter proxy here if wished
return pagefile.read()
2) When we have that html we can determine how many pages there exist in youtube that we might check
def findLastPage(self,html):
linkPattern="page\=[0-9]*"
p = re.compile(linkPattern)
m= p.findall(html)
m.pop() # last item = next
lastPage=m.pop()
lastPage=lastPage[lastPage.rfind('=')+1:] #get rid of the 'page='
return lastPage
As you can see its the second last link that reads something with”page=” (a number)
3) Ok so we have the last page, the first is naturally … 1 :-)
Now all we have to do is to fetch all pages and try to determine all filelinks
for page in range(min, max):
print "fetching: "+baseURL+'/?page='+str(page)
html=self.fetchpage(baseURL+'/?page='+str(page))
self.searchlinks(baseURL,html)
print "pages to check: "+str(len(self.allPages))
But what does “searchlinks” really do ?
Well basically on youtube all files are located at pages that follow the scheme at www.youtube/watch/<aNumberHere>
So we must go through all those pages again using the following code with regular expressions again.
def searchlinks(self,baseURL,html):
linkPattern="/watch/[0-9]*"
p = re.compile(linkPattern)
m= p.findall(html)
pages=[]
for page in set(m):
page=baseURL+page
self.allPages.append(page)
4) Ok … but we see that actually this list is only another list of pages. Yes youtube doesnt directly store files at …/watch/number but rather another page with one(flv file) or two (flv and wmv file) links to the REAL files.
So we need another loop that goes through all “/watch/number”-pages
for page in set(self.allPages):
print "looking for files in: " + page
html=self.fetchpage(page)
self.searchfiles(html)
print "Files to download: "+str(self.allFiles)
5) And for all those pages we call searchfiles that will use a regular expression to find files (and it uses a preference for the type)
def searchfiles(self,html):
flvPattern="http://download.youporn.com/download/[0-9]*/flv/[\w_]*.flv"
wmvPattern="http://download.youporn.com/download/[0-9]*/pc/[\w_]*.wmv"
p = re.compile(flvPattern)
p2 = re.compile(wmvPattern)
m=None
if self.preference=="wmv":
m= p2.search(html)
if m==None or self.preference!="wmv":
m=p.search(html)
if m==None:
return
url = m.group()
filename=url[url.rfind("/")+1:]
self.allFiles[url]=filename
URL and filename are stored in a dictionary.
6) All that is to do now is to download the files.
print "Using %d number of threads to download" %self.numThreads
mtfetcher.download(self.allFiles,self.numThreads)
And “mtfetcher” ? What is that ? Thats a multithreaded downloader. I modified a version from Marcio Boufleur called ppd.py ! It can be found at www-usr-inf.ufsm.br/~boufleur/ppd
Other sites can be downloaded by extending the youpornloader, overwriting the start method and the other methods to fetch pages, search files or links etc.
PornoPython is more or less only a “loader” for all the other classes.