Text-processing
Description
Objective:
The objective of this assignment is for you to practice the use of the common text processing tools available on the Unix systems.
To do this, you will process a large input data set and extract a few bits of information. Each question / exercise can be answered with a simple pipeline of common Unix commands.
This assignment is worth 20 points.
Summary:
Every System Administrator sooner or later finds herself in the position of having to correlate events from e.g. an http log. In this exercise, we will use the web logs provided by the Wikimedia foundation, allowing us to process a sufficiently large and diverse data set.
The Wikimedia Foundation makes available logs of their web servers at https://dumps.wikimedia.org/. The data and format we’re interested in is described in more detail on this page. In a nutshell, the files contain lines of data containing a small number of fields:
domain page_title count_views total_response_size
We will be looking at the data from March 27th, 2016: https://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-03/pagecounts-20160327-000000.gz
Details:
Create a NetBSD EC2 instance of ami-569ed93c. All your work is to be done on this instance. Your solutions will be tested on such an instance. Do NOT work on your own system; the tools may behave differently and I may not be able to verify your solution..
Answer the following questions. Describe your method, then Provide the exact commands you used to determine the answer, then show the answer. For example, if the question was “How many lines are in the file?”, then your answer might be:
Uncompress the input file and pipe output into wc(1):
$ gzcat pagecounts-20160327-000000.gz | wc -l
9771932
Answer all of these questions:
How many unique objects were requested?
How many unique objects were requested for en only?
Which is the most often requested object for de?
How many requests per second were handled during this hour?
How much data was transferred in total?
Which was the largest object requested for fr?
What is the longest word found on the ten most frequently retrieved English Wikipedia pages?
Content
HW5_CS615
- Ankai Liang
- 10411998
Create instance of ami-569ed93c and log in
1 | ➜ ~ git:(master) ✗ aws ec2 run-instances --image-id ami-569ed93c --instance-type t1.micro --key-name keypair3 --security-groups mysg |
1 | ➜ ~ git:(master) ✗ aws ec2 describe-instances --filters "Name=image-id,Values=ami-569ed93c" |
Gain public DNS : ec2-34-207-94-168.compute-1.amazonaws.com
1 | ➜ ~ git:(master) ✗ cd documents |
Use scp
copy the data set into instance.
1 | ➜ ~ git:(master) ✗ scp -i ~/Documents/p.pem ~/Desktop/pagecounts-20160327-000000.gz root@ec2-34-207-94-168.compute-1.amazonaws.com:~/pagecounts-20160327-000000.gz |
keep Waiting until finished.
Answear Question
Catch a glimpse of the data set. Then I know what the data set looks like.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | head -n 3 |
- How many unique objects were requested?
At first time, I found this t1.micor system memory is insufficient. The memory of t1.micor only has 0.613g.
I create another C3.large instance of ami-569ed93c. The process of creation is similar as above, so the same will not be repeated here. The new instance Public DNS is ec2-34-205-154-26.compute-1.amazonaws.com.
I also try to use ‘gzcat pagecounts-20160327-000000.gz | cut -f2 -d’ ‘ | sort | uniq | wc -l’. But the memory still insufficient.
By reading the documents in http://www.gnu.org/software/gawk/manual/gawk.html. I figured out this question.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk 'BEGIN{count=0} {if (data[$2]++ == 0){count++}} END{print "There are", count," unique objects."}' |
After finish Question3, I used ‘[ ]’ as the delimiter.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' 'BEGIN{count=0} {if (data[$2]++ == 0){count++}} END{print "There are", count," unique objects."}' |
After reading mail list professor’s reply,
I realized “unique pairing of domain and resource name” is a unique objects, so it’s very simple:
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | wc -l |
- How many unique objects were requested for en only?
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk 'BEGIN{count=0} {if ($1 == "en" && data[$2]++ == 0){count++}} END{print "There are", count," unique objects for en."}' |
After finish Question3, I used ‘[ ]’ as the delimiter.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' 'BEGIN{count=0} {if ($1 == "en" && data[$2]++ == 0){count++}} END{print "There are", count," unique objects for en."}' |
I also found a more clear solution:
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^en[ ]" | wc -l |
- Which is the most often requested object for de?
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk 'BEGIN{maxCount=0; ob=""} {if ($1 == "de" && $3>maxCount){maxCount = $3; ob = $2}} END{print "The most often requested object for de is", ob, " maxcount is ", maxCount}' |
I feel that this result is unnormal. So I check this record.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk '{if ($1=="de" && ($2 =="575")) print $0}' |
Seems like this recode lacked parameter ‘page_title’ (or the ‘page_title’ is null), so the awk
recognized parameter ‘count_views’ as ‘page_title’ and ‘total_response_size’ as ‘count_views’. And this is not an accidental phenomena.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk '{if ($4 == "") print $0}' |
I want to specificly set the delimiter is one whitespace. I tried “-F ‘ ‘“, but failed. Then I realized the delimiter is a regular expression. I knew how to do:
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' 'BEGIN{maxCount=0; ob=""} {if ($1 == "de" && $3>maxCount){maxCount = $3; ob = $2}} END{print "The most often requested object for de is", ob, " maxcount is ", maxCount}' |
The most often requested object for de is Wikipedia:Hauptseite.
I also found a more clear solution (we knew that the object in each row is unique):
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^de[ ]" | sort -t' ' -k3,3nr | head -1 | cut -f1,2 -d' ' |
- How many requests per second were handled during this hour?
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' '{count+=$3} END{print count/3600}' |
9999.72 requests per second.
- How much data was transferred in total?
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' '{count+=$4} END{print count}' |
800635232210 data was transferred in total.
I look at the detail of this log in https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites.
The unit of size is ‘byte’. So there are 800635232210 bytes.
- Which was the largest object requested for fr?
I thought that the size of object equals ‘total_response_size’/‘count_views’.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | awk -F '[ ]' 'BEGIN{max=0; ob=""} {if ($1 == "fr" && ($4/$3)>max){max = ($4/$3); ob = $2}} END{print "The largest object requested for fr is", ob, " the size is ", max}' |
A more clear version:
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^fr[ ]" | awk -F '[ ]' '{print $0,$4/$3}' | sort -t' ' -k5,5nr | head -1 | cut -f1,2 -d' ' |
The largest object requested for fr is Projet:Palette/Maintenance/Listes
- What is the longest word found on the ten most frequently retrieved English Wikipedia pages?
To solve this proplem, first we should get the ten most frequently retrieved ‘page_title’. English pages means domain is ‘en’ or ‘en.*’
At first time, I found some pages can’t crawl anything. Then I found:
In https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw#Aggregation_for_.mw
1 | Aggregation for .mw |
So I did grep
to gain the ‘en’ domain except ‘en.mw’ domain wikipedia pages, then sort by ‘counts_view’ and cat the first 10 line.
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^en" | grep -v "^en.mw" | sort -t' ' -k3,3nr | head -10 | cut -f1,2 -d' ' |
Then I need to generate the url of each pages, use awk
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^en" | grep -v "^en.mw" | sort -t' ' -k3,3nr | head -10 | cut -f1,2 -d' ' | awk '{printf ("https://%s.wikipedia.org/wiki/%s\n",$1,$2)}' |
I choose curl to crawl web data. Before that, I need to install curl. Found resource in https://curl.haxx.se/download.html
1 | ip-172-31-53-207# ftp https://curl.haxx.se/download/curl-7.53.1.tar.gz |
Crawl one pages for test.
1 | ip-172-31-53-207# curl https://en.wikipedia.org/wiki/Simon_Pegg |
The log say I need to do the SSL certificatation, I choose use -k to skip this step. And I want to ignore the log information in later process.
1 | ip-172-31-53-207# curl -k -s https://en.wikipedia.org/wiki/Simon_Pegg | head -6 |
Then I need to clean up these useless character. I will use ‘tr’ to replace some symbol and character with space, only leave alpha.
1 | ip-172-31-53-207# curl -k -s https://en.wikipedia.org/wiki/Simon_Pegg | head -6 | tr -c "[:alpha:]" " " |
Use ‘awk’ to calculate the answear.
1 | ...| awk 'BIGIN {maxWord=0; maxLeng=0;} {for(i=1;i<=NF;i++) if(length($i)>maxLeng){maxWord=$i; maxLeng=length($i);}} END {print maxWord,maxLeng}' |
In the end, I also need to use ‘xargs’ to transfer the ‘url’ to ‘curl’ one by one, combine all of these:
1 | ip-172-31-53-207# gzcat pagecounts-20160327-000000.gz | grep "^en" | grep -v "^en.mw" | sort -t' ' -k3,3nr | head -10 | cut -f1,2 -d' ' | awk '{printf ("https://%s.wikipedia.org/wiki/%s\n",$1,$2)}' | xargs curl -k -s | tr -c "[:alpha:]" " " | awk 'BEGIN {maxWord=0; maxLeng=0;} {for(i=1;i<=NF;i++) if(length($i)>maxLeng){maxWord=$i; maxLeng=length($i);}} END {print maxWord,maxLeng}' |
So the longest word found on the ten most frequently retrieved English Wikipedia pages is wgRelatedArticlesOnlyUseCirrusSearch.
rm $tmp
rm listInstanceId.txt
exit 0