Scraping table from any web page with R or CloudStat:
You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.
Thanks to XML package from R. It provides amazing readHTMLtable() function.
For a study case,
I want to scrape data:
US Airline Customer Score.
World Top Chess Players (Men).
A. Scraping US Airline Customer Score table from
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines
Code:
airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’
airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
Result:
> library(XML)
Warning message:
package "XML" was built under R version 2.14.1
> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table
Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1 Southwest 78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2 All Others NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3 Airlines 72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4 Continental 67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5 American 70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6 United 71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7 US Airways 72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8 Delta 77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines 69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61
11 PreviousYear%Change FirstYear%Change
1 81 2.5 3.8
3 65 -1.5 -9.7
4 64 -9.9 -4.5
5 63 0.0 -10.0
7 61 -1.6 -15.3
8 56 -9.7 -27.3
9 # N/A N/A
>
B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men
Code:
chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
Result:
> chess = "http://ratings.fide.com/top.phtml?list=men"
> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table
Rank Name Title Country Rating Games B-Year
1 1 Carlsen, Magnus g NOR 2835 17 1990
2 2 Aronian, Levon g ARM 2805 25 1982
3 3 Kramnik, Vladimir g RUS 2801 17 1975
4 4 Anand, Viswanathan g IND 2799 17 1969
5 5 Radjabov, Teimour g AZE 2773 9 1987
6 6 Topalov, Veselin g BUL 2770 9 1975
7 7 Karjakin, Sergey g RUS 2769 16 1990
8 8 Ivanchuk, Vassily g UKR 2766 16 1969
9 9 Morozevich, Alexander g RUS 2763 6 1977
10 10 Gashimov, Vugar g AZE 2761 9 1986
11 11 Grischuk, Alexander g RUS 2761 8 1983
12 12 Nakamura, Hikaru g USA 2759 17 1987
13 13 Svidler, Peter g RUS 2749 17 1976
14 14 Mamedyarov, Shakhriyar g AZE 2747 9 1985
15 15 Tomashevsky, Evgeny g RUS 2740 0 1987
16 16 Gelfand, Boris g ISR 2739 9 1968
17 17 Caruana, Fabiano g ITA 2736 19 1992
18 18 Nepomniachtchi, Ian g RUS 2735 16 1990
19 19 Wang, Hao g CHN 2733 6 1989
20 20 Kamsky, Gata g USA 2732 0 1974
21 21 Dominguez Perez, Leinier g CUB 2730 6 1983
22 22 Jakovenko, Dmitry g RUS 2729 0 1983
23 23 Ponomariov, Ruslan g UKR 2727 13 1983
24 24 Vitiugov, Nikita g RUS 2726 1 1987
25 25 Adams, Michael g ENG 2724 17 1971
26 26 Leko, Peter g HUN 2720 9 1979
27 27 Almasi, Zoltan g HUN 2717 8 1976
28 28 Giri, Anish g NED 2714 15 1994
29 29 Le, Quang Liem g VIE 2714 0 1991
30 30 Navara, David g CZE 2712 8 1985
31 31 Shirov, Alexei g LAT 2710 13 1972
32 32 Polgar, Judit g HUN 2710 0 1976
33 33 Riazantsev, Alexander g RUS 2710 0 1985
34 34 Wojtaszek, Radoslaw g POL 2706 8 1987
35 35 Moiseenko, Alexander g UKR 2706 7 1980
36 36 Vallejo Pons, Francisco g ESP 2705 15 1982
37 37 Malakhov, Vladimir g RUS 2705 0 1980
38 38 Jobava, Baadur g GEO 2704 23 1983
39 39 Bacrot, Etienne g FRA 2704 14 1983
40 40 Laznicka, Viktor g CZE 2704 8 1988
41 41 Sutovsky, Emil g ISR 2703 8 1977
42 42 Naiditsch, Arkadij g GER 2702 14 1985
43 43 Movsesian, Sergei g ARM 2700 9 1978
44 44 Sasikiran, Krishnan g IND 2700 9 1981
45 45 Vachier-Lagrave, Maxime g FRA 2699 13 1990
46 46 Dreev, Aleksey g RUS 2698 6 1969
47 47 Efimenko, Zahar g UKR 2695 8 1985
48 48 Volokitin, Andrei g UKR 2695 0 1986
49 49 Wang, Yue g CHN 2694 6 1987
50 50 Fressinet, Laurent g FRA 2693 17 1981
51 51 Li, Chao b g CHN 2693 6 1989
52 52 Grachev, Boris g RUS 2693 0 1986
53 53 Nielsen, Peter Heine g DEN 2693 0 1973
54 54 Van Wely, Loek g NED 2692 13 1972
55 55 Bruzon Batista, Lazaro g CUB 2691 19 1982
56 56 McShane, Luke J g ENG 2691 8 1984
57 57 Eljanov, Pavel g UKR 2690 10 1983
58 58 Kasimdzhanov, Rustam g UZB 2689 14 1979
59 59 Inarkiev, Ernesto g RUS 2689 6 1985
60 60 Zvjaginsev, Vadim g RUS 2688 8 1976
61 61 Andreikin, Dmitry g RUS 2688 0 1990
62 62 Areshchenko, Alexander g UKR 2688 0 1986
63 63 Rublevsky, Sergei g RUS 2686 0 1974
64 64 Akopian, Vladimir g ARM 2685 8 1971
65 65 Potkin, Vladimir g RUS 2684 0 1982
66 66 Sargissian, Gabriel g ARM 2683 15 1983
67 67 Berkes, Ferenc g HUN 2682 16 1985
68 68 Bologan, Viktor g MDA 2680 15 1971
69 69 Bauer, Christian g FRA 2679 24 1977
70 70 Tiviakov, Sergei g NED 2677 22 1973
71 71 Short, Nigel D g ENG 2677 15 1965
72 72 Motylev, Alexander g RUS 2677 6 1979
73 73 Gharamian, Tigran g FRA 2676 0 1984
74 74 Kobalia, Mikhail g RUS 2673 0 1978
75 75 Meier, Georg g GER 2671 9 1987
76 76 Onischuk, Alexander g USA 2670 13 1975
77 77 Bu, Xiangzhi g CHN 2670 6 1985
78 78 Alekseev, Evgeny g RUS 2670 0 1985
79 79 Azarov, Sergei g BLR 2667 0 1983
80 80 Kryvoruchko, Yuriy g UKR 2666 0 1986
81 81 Balogh, Csaba g HUN 2665 8 1987
82 82 Harikrishna, P. g IND 2665 6 1986
83 83 Khismatullin, Denis g RUS 2664 8 1984
84 84 Nguyen, Ngoc Truong Son g VIE 2662 6 1990
85 85 Fridman, Daniel g GER 2660 11 1976
86 86 Smirin, Ilia g ISR 2660 7 1968
87 87 Ding, Liren g CHN 2660 6 1992
88 88 Sadler, Matthew D g ENG 2660 3 1974
89 89 Korobov, Anton g UKR 2660 0 1985
90 90 Cheparinov, Ivan g BUL 2659 18 1986
91 91 Timofeev, Artyom g RUS 2659 0 1985
92 92 Georgiev, Kiril g BUL 2658 17 1965
93 93 Bartel, Mateusz g POL 2658 9 1985
94 94 Zhigalko, Sergei g BLR 2658 8 1989
95 95 Feller, Sebastien g FRA 2658 0 1991
96 96 Ragger, Markus g AUT 2655 17 1988
97 97 Jones, Gawain C B g ENG 2653 27 1987
98 98 So, Wesley g PHI 2653 5 1993
99 99 Milov, Vadim g SUI 2653 0 1972
100 100 Gupta, Abhijeet g IND 2652 9 1989
101 101 Postny, Evgeny g ISR 2652 8 1981
102 102 Roiz, Michael g ISR 2652 6 1983
103 103 Gyimesi, Zoltan g HUN 2652 4 1977
104 104 Nikolic, Predrag g BIH 2652 2 1960
>
Done. You had successfully scraping data from any web page with R or CloudStat.
Then, you can analyze as usual! Great! No more retype the data. Enjoy!
Source: http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/
You need to use the data from internet, but don’t type, you can just extract or scrape them if you know the web URL.
Thanks to XML package from R. It provides amazing readHTMLtable() function.
For a study case,
I want to scrape data:
US Airline Customer Score.
World Top Chess Players (Men).
A. Scraping US Airline Customer Score table from
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines
Code:
airline = ‘http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines’
airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
Result:
> library(XML)
Warning message:
package "XML" was built under R version 2.14.1
> airline = "http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines"
> airline.table = readHTMLTable(airline, header=T, which=1,stringsAsFactors=F)
> airline.table
Base-line 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
1 Southwest 78 76 76 76 74 72 70 70 74 75 73 74 74 76 79 81 79
2 All Others NM 70 74 70 62 67 63 64 72 74 73 74 74 75 75 77 75
3 Airlines 72 69 69 67 65 63 63 61 66 67 66 66 65 63 62 64 66
4 Continental 67 64 66 64 66 64 62 67 68 68 67 70 67 69 62 68 71
5 American 70 71 71 62 67 64 63 62 63 67 66 64 62 60 62 60 63
6 United 71 67 70 68 65 62 62 59 64 63 64 61 63 56 56 56 60
7 US Airways 72 67 66 68 65 61 62 60 63 64 62 57 62 61 54 59 62
8 Delta 77 72 67 69 65 68 66 61 66 67 67 65 64 59 60 64 62
9 Northwest Airlines 69 71 67 64 63 53 62 56 65 64 64 64 61 61 57 57 61
11 PreviousYear%Change FirstYear%Change
1 81 2.5 3.8
3 65 -1.5 -9.7
4 64 -9.9 -4.5
5 63 0.0 -10.0
7 61 -1.6 -15.3
8 56 -9.7 -27.3
9 # N/A N/A
>
B. Scraping World Top Chess players (Men) table from http://ratings.fide.com/top.phtml?list=men
Code:
chess = ‘http://ratings.fide.com/top.phtml?list=men’
chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
Result:
> chess = "http://ratings.fide.com/top.phtml?list=men"
> chess.table = readHTMLTable(chess, header=T, which=5,stringsAsFactors=F)
> chess.table
Rank Name Title Country Rating Games B-Year
1 1 Carlsen, Magnus g NOR 2835 17 1990
2 2 Aronian, Levon g ARM 2805 25 1982
3 3 Kramnik, Vladimir g RUS 2801 17 1975
4 4 Anand, Viswanathan g IND 2799 17 1969
5 5 Radjabov, Teimour g AZE 2773 9 1987
6 6 Topalov, Veselin g BUL 2770 9 1975
7 7 Karjakin, Sergey g RUS 2769 16 1990
8 8 Ivanchuk, Vassily g UKR 2766 16 1969
9 9 Morozevich, Alexander g RUS 2763 6 1977
10 10 Gashimov, Vugar g AZE 2761 9 1986
11 11 Grischuk, Alexander g RUS 2761 8 1983
12 12 Nakamura, Hikaru g USA 2759 17 1987
13 13 Svidler, Peter g RUS 2749 17 1976
14 14 Mamedyarov, Shakhriyar g AZE 2747 9 1985
15 15 Tomashevsky, Evgeny g RUS 2740 0 1987
16 16 Gelfand, Boris g ISR 2739 9 1968
17 17 Caruana, Fabiano g ITA 2736 19 1992
18 18 Nepomniachtchi, Ian g RUS 2735 16 1990
19 19 Wang, Hao g CHN 2733 6 1989
20 20 Kamsky, Gata g USA 2732 0 1974
21 21 Dominguez Perez, Leinier g CUB 2730 6 1983
22 22 Jakovenko, Dmitry g RUS 2729 0 1983
23 23 Ponomariov, Ruslan g UKR 2727 13 1983
24 24 Vitiugov, Nikita g RUS 2726 1 1987
25 25 Adams, Michael g ENG 2724 17 1971
26 26 Leko, Peter g HUN 2720 9 1979
27 27 Almasi, Zoltan g HUN 2717 8 1976
28 28 Giri, Anish g NED 2714 15 1994
29 29 Le, Quang Liem g VIE 2714 0 1991
30 30 Navara, David g CZE 2712 8 1985
31 31 Shirov, Alexei g LAT 2710 13 1972
32 32 Polgar, Judit g HUN 2710 0 1976
33 33 Riazantsev, Alexander g RUS 2710 0 1985
34 34 Wojtaszek, Radoslaw g POL 2706 8 1987
35 35 Moiseenko, Alexander g UKR 2706 7 1980
36 36 Vallejo Pons, Francisco g ESP 2705 15 1982
37 37 Malakhov, Vladimir g RUS 2705 0 1980
38 38 Jobava, Baadur g GEO 2704 23 1983
39 39 Bacrot, Etienne g FRA 2704 14 1983
40 40 Laznicka, Viktor g CZE 2704 8 1988
41 41 Sutovsky, Emil g ISR 2703 8 1977
42 42 Naiditsch, Arkadij g GER 2702 14 1985
43 43 Movsesian, Sergei g ARM 2700 9 1978
44 44 Sasikiran, Krishnan g IND 2700 9 1981
45 45 Vachier-Lagrave, Maxime g FRA 2699 13 1990
46 46 Dreev, Aleksey g RUS 2698 6 1969
47 47 Efimenko, Zahar g UKR 2695 8 1985
48 48 Volokitin, Andrei g UKR 2695 0 1986
49 49 Wang, Yue g CHN 2694 6 1987
50 50 Fressinet, Laurent g FRA 2693 17 1981
51 51 Li, Chao b g CHN 2693 6 1989
52 52 Grachev, Boris g RUS 2693 0 1986
53 53 Nielsen, Peter Heine g DEN 2693 0 1973
54 54 Van Wely, Loek g NED 2692 13 1972
55 55 Bruzon Batista, Lazaro g CUB 2691 19 1982
56 56 McShane, Luke J g ENG 2691 8 1984
57 57 Eljanov, Pavel g UKR 2690 10 1983
58 58 Kasimdzhanov, Rustam g UZB 2689 14 1979
59 59 Inarkiev, Ernesto g RUS 2689 6 1985
60 60 Zvjaginsev, Vadim g RUS 2688 8 1976
61 61 Andreikin, Dmitry g RUS 2688 0 1990
62 62 Areshchenko, Alexander g UKR 2688 0 1986
63 63 Rublevsky, Sergei g RUS 2686 0 1974
64 64 Akopian, Vladimir g ARM 2685 8 1971
65 65 Potkin, Vladimir g RUS 2684 0 1982
66 66 Sargissian, Gabriel g ARM 2683 15 1983
67 67 Berkes, Ferenc g HUN 2682 16 1985
68 68 Bologan, Viktor g MDA 2680 15 1971
69 69 Bauer, Christian g FRA 2679 24 1977
70 70 Tiviakov, Sergei g NED 2677 22 1973
71 71 Short, Nigel D g ENG 2677 15 1965
72 72 Motylev, Alexander g RUS 2677 6 1979
73 73 Gharamian, Tigran g FRA 2676 0 1984
74 74 Kobalia, Mikhail g RUS 2673 0 1978
75 75 Meier, Georg g GER 2671 9 1987
76 76 Onischuk, Alexander g USA 2670 13 1975
77 77 Bu, Xiangzhi g CHN 2670 6 1985
78 78 Alekseev, Evgeny g RUS 2670 0 1985
79 79 Azarov, Sergei g BLR 2667 0 1983
80 80 Kryvoruchko, Yuriy g UKR 2666 0 1986
81 81 Balogh, Csaba g HUN 2665 8 1987
82 82 Harikrishna, P. g IND 2665 6 1986
83 83 Khismatullin, Denis g RUS 2664 8 1984
84 84 Nguyen, Ngoc Truong Son g VIE 2662 6 1990
85 85 Fridman, Daniel g GER 2660 11 1976
86 86 Smirin, Ilia g ISR 2660 7 1968
87 87 Ding, Liren g CHN 2660 6 1992
88 88 Sadler, Matthew D g ENG 2660 3 1974
89 89 Korobov, Anton g UKR 2660 0 1985
90 90 Cheparinov, Ivan g BUL 2659 18 1986
91 91 Timofeev, Artyom g RUS 2659 0 1985
92 92 Georgiev, Kiril g BUL 2658 17 1965
93 93 Bartel, Mateusz g POL 2658 9 1985
94 94 Zhigalko, Sergei g BLR 2658 8 1989
95 95 Feller, Sebastien g FRA 2658 0 1991
96 96 Ragger, Markus g AUT 2655 17 1988
97 97 Jones, Gawain C B g ENG 2653 27 1987
98 98 So, Wesley g PHI 2653 5 1993
99 99 Milov, Vadim g SUI 2653 0 1972
100 100 Gupta, Abhijeet g IND 2652 9 1989
101 101 Postny, Evgeny g ISR 2652 8 1981
102 102 Roiz, Michael g ISR 2652 6 1983
103 103 Gyimesi, Zoltan g HUN 2652 4 1977
104 104 Nikolic, Predrag g BIH 2652 2 1960
>
Done. You had successfully scraping data from any web page with R or CloudStat.
Then, you can analyze as usual! Great! No more retype the data. Enjoy!
Source: http://www.r-bloggers.com/scraping-table-from-any-web-page-with-r-or-cloudstat/
No comments:
Post a Comment