Rabu, 02 Maret 2011

Merubah scanned pdf menjadi teks html


           Ketika mencari sumber referensi, saya mendapatkan file pdf yang contentnya/teks berupa gambar. Akibatnya saya tidak bisa mengcopy tulisannya(wkwkwk... kebiasaan buruk). Ada salah satu cara untuk merubah scanned pdf menjadi html, dengan menggunakan fasilitas google. Jika mendapatkan hasil pencarian file pdf, maka google akan menampilkan opsi untuk menampilkannya dalam bentuk html. Cara ini hanya berguna jika file pdfnya terindeks semua dan google hanya menampilkan 20 halaman pertamanya saja. 

         Ada satu cara lain, yaitu mengkonversi file pdf tersebut menjadi kumpulan gambar(.png, .jpg), lalu kemudian menscan gambar tersebut dengan OCR, lalu untuk outputnya berupa HTML atau file teks. Karena saya(penulis) menggunakan operating system GNU/linux ubuntu, maka cara dibawah ini hanya berlaku untuk OS GNU/linux saja( untuk yang versi windows, tanya ke orang lain saja yaaa!).

Langsung saja....
Pertama-tama install terlebih dahulu xpdf, imagemagick, dan ocropus


sudo apt-get install xpdf imagemagick ocropus

gunakan script dibawah ini untuk menkonversi. Simpan dengan nama "pdf2txt"(tanpa tanda kutip).


#! /bin/sh
# Simple wrapper to recognize PDF files
#
# This script is based on ocropdf by Christian Mahnke
# original script can be downloaded from http://groups.google.com/group/ocropus/attach/e3cd3c9c36dfce87/ocropdf?part=2
#
# I just modify it, to achieve my requirement
#
#Usage:
#pdf2html input.pdf > hocr-output.html
#
#The following environment variables are recognised:
#- PDFIMAGES: Path to 'pdfimages' if it's not in your path
#- CONVERT: Path to 'pdfimages' if it's not in your path
#- OCROSCRIPT: Path to 'ocroscript' if it's not in your path or this script is not #placed in the ocropus source tree (in the 'ocrocmd' directory)
#- tesslanguage: The language tesseract should use.
#
#
#Known problems:
# - Doesn't work with file names containing spaces.
# - Only works with a singe PDF file.
#
#Possible improvements
# - reimplement it as Lua script.
# - Use this approach (imagemagick) to be able to recognise TIFF and other file formats.
#
# By Alvin from orangunix.blogspot.com
if test -z "$PDFIMAGES" ; then
PDFIMAGES=`which pdfimages`
fi
if test -z "$CONVERT" ; then
CONVERT=`which convert`
fi
if test -z "$PDFIMAGES" ; then
echo "'pdfimages' not found in PATH (it's part of the xpdf package)"
fi
if test -z "$PDFIMAGES" ; then
echo "'convert' not found in PATH (it's part of the imagemagick package)"
fi
if test -z "$OCROSCRIPT" ; then
OCROSCRIPT=`which ocroscript`
if test -z "$OCROSCRIPT" ; then
DIR=`dirname $0`/../ocroscript
OCROSCRIPT="$DIR/ocroscript"
if test -z "$OCROSCRIPTS" ; then
OCROSCRIPTS=$DIR/scripts
fi
fi
fi
if test -z "$1" ; then
echo "Usage: ./pdf2txt input.pdf > hocr-output.html"
exit 1
fi
TMP_DIR=`tempfile -p pdf2txt`
rm -f $TMP_DIR
mkdir $TMP_DIR
echo $TMP_DIR
PDFIMAGES_CMD="$PDFIMAGES $1 $TMP_DIR/pdf2txt"
echo $PDFIMAGES_CMD
$PDFIMAGES_CMD
echo $TMP_DIR
for FILE in `ls $TMP_DIR`
do
#echo $FILE
CONVERT_CMD="$CONVERT $TMP_DIR/$FILE $TMP_DIR/$FILE.jpg"
$CONVERT_CMD
if test $? != 0 ; then
echo "'convert' failed"
exit 2
fi
FILES="$FILES $TMP_DIR/$FILE.jpg"
#rm -f $TMP_DIR/$FILE
done
$OCROSCRIPT recognize `ls $TMP_DIR | grep .pbm.jpg`
#if test -z "$tesslanguage" ; then
# OCROSCIPT_CMD="$OCROSCRIPT rec-tess $FILES"
#else
# OCROSCIPT_CMD="$OCROSCRIPT rec-tess --tesslanguage=$tesslanguage $FILES"
#fi
#
#$OCROSCIPT_CMD
#rm -r $TMP_DIR


Cara penggunaan :
Misalnya anda ingin menkoversi "makalah.pdf" ke dalam file html, gunakan perintah :

./txt2pdf makalah.pdf > makalah.html

perintah diatas akan menkonversi "makalah.pdf" dan outputnya "makalah.html".

oleh : Alvin

READ MORE - Merubah scanned pdf menjadi teks html

Minggu, 27 Februari 2011

Manajemen Bandwidth dengan Squid


Squid merupakan software proxy server/cache server yang free dan opensource. Squid biasanya dijalankan di sistem operasi GNU/Linux. Squid juga memiliki kelebihan mengatur dan membatasi bandwidth berdasarkan extension, misalnya sebuah server di konfigurasi untuk membatasi client hanya menggunakan bandwidth 100kbps untuk mendownload file berekstensi .flv, .avi, dan sebagainya.

Langsung saja ke tahap-tahap instalasi :
  • Saya sarankan menginstall squid dari source, sehingga dapat dikonfigurasi saat compile, karena secara default squid tidak mendukung fitur delay pool sebagai pengatur bandwidth.

  • Untuk mendapatkan performa maksimum, buatlah partisi tersendiri dengan ukuran kira-kira 300MB, beri nama partisi "/cache"(tanpa kutip).

  • Tambahkan/buat user "squid" ke dalam sistem :
useradd -d /cache/ -r -s /dev/null squid >/dev/null 2>&1
  • Download source squid menggunakan link di bawah ini(saat tulisan ini dibuat, rilis squid terbaru adalah versi 3.1.11) :
http://www.squid-cache.org/Versions/v3/3.1/squid-3.1.11.tar.gz
  • Extract file .tar.gz :
tar xzpf squid-3.1.11.tar.gz
  • compile dan install squid, folder instalasi adalah /opt
./configure --prefix=/opt/squid --exec-prefix=/opt/squid --enable-delay-pools --enable-cache-digests --enable-poll --disable-ident-lookups --enable-truncate --enable-removal-policies

make all

make install
  • konfigurasi file squid.conf(terletak di /opt/squid/etc/squid.conf)
#
# Recommended minimum configuration:
#
acl manager proto cache_object
acl localhost src 127.0.0.1/32 ::1
acl to_localhost dst 127.0.0.0/8 0.0.0.0/32 ::1

# Example rule allowing access from your local networks.
# Adapt to list your (internal) IP networks from where browsing
# should be allowed
acl localnet src 10.0.0.0/8 # RFC1918 possible internal network
#acl localnet src 172.16.0.0/12 # RFC1918 possible internal network
acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
#acl localnet src fc00::/7 # RFC 4193 local private network range
#acl localnet src fe80::/10 # RFC 4291 link-local (directly plugged) machines

acl SSL_ports port 443
acl Safe_ports port 80 # http
acl Safe_ports port 21 # ftp
acl Safe_ports port 443 # https
#acl Safe_ports port 70 # gopher
#acl Safe_ports port 210 # wais
acl Safe_ports port 1025-65535 # unregistered ports
#acl Safe_ports port 280 # http-mgmt
#acl Safe_ports port 488 # gss-http
#acl Safe_ports port 591 # filemaker
#acl Safe_ports port 777 # multiling http
acl CONNECT method CONNECT

#
# Recommended minimum Access Permission configuration:
#
# Only allow cachemgr access from localhost
http_access allow manager localhost
http_access deny manager

# Deny requests to certain unsafe ports
http_access deny !Safe_ports

# Deny CONNECT to other than secure SSL ports
http_access deny CONNECT !SSL_ports

# We strongly recommend the following be uncommented to protect innocent
# web applications running on the proxy server who think the only
# one who can access services on "localhost" is a local user
http_access deny to_localhost

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#

# squid as an httpd accelerator
#httpd_accel_host virtual
# port you want to act as a proxy
#httpd_accel_port 80
# Squid act as both a local httpd accelerator and as a proxy
#httpd_accel_with_proxy on
# Header is turned on which is the hostname from the URL
#httpd_accel_uses_host_header on

# Example rule allowing access from your local networks.
# Adapt localnet in the ACL section to list your (internal) IP networks
# from where browsing should be allowed
http_access allow localnet
http_access allow localhost

# And finally deny all other access to this proxy
http_access deny all

# Squid normally listens to port 3128
http_port 3128

# We recommend you to use at least the following line.
hierarchy_stoplist cgi-bin ?

# set who user and group that squid will run, note : must run with root first!
cache_effective_user squid
cache_effective_group squid
#Memory the Squid will use. Well, Squid will use far more than that.
cache_mem 16 MB

# Uncomment and adjust the following to add a disk cache directory.
cache_dir ufs /cache 250 16 256

# Leave coredumps in the first cache dir
coredump_dir /opt/squid/var/cache

#Places where Squid's logs will go to.
cache_log /var/log/squid/cache.log
access_log /var/log/squid/access.log
cache_store_log /var/log/squid/store.log
cache_swap_log /var/log/squid/swap.log
#How many times to rotate the logs before deleting them.
#See the FAQ for more info.
logfile_rotate 10

# Add any of your own refresh_pattern entries above these.
refresh_pattern ^ftp: 1440 20% 10080
refresh_pattern ^gopher: 1440 0% 1440
refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
refresh_pattern . 0 20% 4320

#all our LAN users will be seen by external web servers
#as if they all used Mozilla on windows xp sp2. :)
#anonymize_headers deny User-Agent
#fake_user_agent Mozilla/5.0 (compatible; U;windows xp sp2; en-US)

### DELAY POOLS ###

#This is the most important part for shaping incoming traffic with Squid
#For detailed description see squid.conf file or docs at http://www.squid-cache.org

#kita ingin membatasi bandwidth untuk mendownload file jenis yang disebutkan di bawah ini
#tulis semua dalam satu baris
acl magic_words url_regex -i ftp .exe .mp3 .vqf .tar.gz .gz .rpm .deb .zip .rar .avi .mpeg .mpe .bin .sh .tar.bz2 .pdf .mkv .ogg .mpg .qt .ram .rm .iso .raw .wav .mov .wmv .flv .mp4 #kita tidak membatasi .html, .gif, .jpg dan file lain yang sejenis
#karena tidak terlalu memboroskan bandwidth

#We have two different delay_pools
#View Squid documentation to get familiar
#with delay_pools and delay_class.
delay_pools 2

#First delay pool
#We don't want to delay our local traffic.
#There are three pool classes; here we will deal only with the second.
#First delay class (1) of second type (2).
delay_class 1 2

#-1/-1 mean that there are no limits.
#The numbers here are values in bytes;
#we must remember that Squid doesn't consider start/stop bits
#5000/150000 are values for the whole network
#5000/120000 are values for the single IP
#after downloaded files exceed about 150000 bytes,
#(or even twice or three times as much)
#they will continue to download at about 5000 bytes/s

delay_parameters 1 20000/150000 20000/120000
delay_access 1 allow magic_words


  • konfigurasi folder squid
mkdir /var/log/squid/
chown squid:squid /var/log/squid/
chmod 770 /var/log/squid/
chown -R squid:squid /opt/squid/
chown -R squid:squid /opt/squid/

  • karena pertama kali squid dijalankan, perintahkan squid untuk membuat terlebih dahulu folder cache
/opt/squid/bin/squid -z
  • jalankan squid
/opt/squid/bin/squid
Squid akan berjalan pada port 3128.

READ MORE - Manajemen Bandwidth dengan Squid
Add to Technorati Favorites Add to Technorati Favorites
OrangUNIX © 2008 Template by:
SkinCorner