Moviegrabbing with PPython: How Did You Do that ?

Last Update: 20.12.2006. By kerim in python

Now for those that are interested in HOW I did it. It’s actually pretty simple.

First of all pick a site that you want to download content from. I look around but since as said earlier i didnt find many “normal” ones i have choosen youporn.com.
Two reasons:
1) Plenty of content
2) Rather simple way of getting at it.

Here is how to do it:

1) We start at the base URL www.youtube.com
and simply read the html

def fetchpage(self,url):  
        pagefile= urllib.urlopen(url) #enter proxy here if wished  
        return pagefile.read()

2) When we have that html we can determine how many pages there exist in youtube that we might check

def findLastPage(self,html):  
        linkPattern="page\=[0-9]*"  
        p = re.compile(linkPattern)  
        m= p.findall(html)  
        m.pop() # last item = next  
        lastPage=m.pop()  
        lastPage=lastPage[lastPage.rfind('=')+1:] #get rid of the 'page='  
        return lastPage

As you can see its the second last link that reads something with”page=” (a number)

3) Ok so we have the last page, the first is naturally … 1 :-)
Now all we have to do is to fetch all pages and try to determine all filelinks

for page in range(min, max):  
       print "fetching: "+baseURL+'/?page='+str(page)  
       html=self.fetchpage(baseURL+'/?page='+str(page))  
       self.searchlinks(baseURL,html)  
print "pages to check: "+str(len(self.allPages))

But what does “searchlinks” really do ?
Well basically on youtube all files are located at pages that follow the scheme at www.youtube/watch/<aNumberHere>
So we must go through all those pages again using the following code with regular expressions again.

def searchlinks(self,baseURL,html):  
    linkPattern="/watch/[0-9]*"  
    p = re.compile(linkPattern)  
    m= p.findall(html)  
    pages=[]  
    for page in set(m):  
        page=baseURL+page  
        self.allPages.append(page)

4) Ok … but we see that actually this list is only another list of pages. Yes youtube doesnt directly store files at …/watch/number but rather another page with one(flv file) or two (flv and wmv file) links to the REAL files.
So we need another loop that goes through all “/watch/number”-pages

for page in set(self.allPages):  
        print "looking for files in: " + page    
        html=self.fetchpage(page)  
        self.searchfiles(html)   
    print "Files to download: "+str(self.allFiles)

5) And for all those pages we call searchfiles that will use a regular expression to find files (and it uses a preference for the type)

def searchfiles(self,html):  
    flvPattern="http://download.youporn.com/download/[0-9]*/flv/[\w_]*.flv"  
    wmvPattern="http://download.youporn.com/download/[0-9]*/pc/[\w_]*.wmv"  
    p = re.compile(flvPattern)  
    p2 = re.compile(wmvPattern)  
    m=None  
    if self.preference=="wmv":  
        m= p2.search(html)  
    if m==None or self.preference!="wmv":      
        m=p.search(html)  
        if m==None:  
            return  
    url = m.group()  
    filename=url[url.rfind("/")+1:]  
    self.allFiles[url]=filename

URL and filename are stored in a dictionary.
6) All that is to do now is to download the files.

    print "Using %d number of threads to download" %self.numThreads  
    mtfetcher.download(self.allFiles,self.numThreads)

And “mtfetcher” ? What is that ? Thats a multithreaded downloader. I modified a version from Marcio Boufleur called ppd.py ! It can be found at www-usr-inf.ufsm.br/~boufleur/ppd

Other sites can be downloaded by extending the youpornloader, overwriting the start method and the other methods to fetch pages, search files or links etc.

PornoPython is more or less only a “loader” for all the other classes.