Analysis posts of the social network Twitter with the stream processing systems Apache Spark and Apache Storm

N.A. Gorshkov, V.S. Denisiov

Abstract


The article discusses the comparison streaming processing systems Apache Storm and Apache Spark in the problem analysis the social network Twitter posts. At first, it describes the basic concepts of engines, their settings and launching applications. Then specific problems of tweets analysis are considered, as well as the structure of the cluster on which the performance test was carried out. In conclusion, the findings were made on the applicability of Storm and Spark for the considered problems.

Full Text:

PDF (Russian)

References


Hesla, “Particle physics tames big data” http://www.symmetrymagazine.org/article/august-2012/particle-physics-tames-big-data

Hirak Kashyap, Hasin Afzal Ahmed, “Big Data Analytics in Bioinformatics: A Machine Learning Perspective” http://arxiv.org/pdf/1506.05101.pdf

Eric D. Feigelson and G. Jogesh Babu, “Big data in astronomy” http://astrostatistics.psu.edu/2012Significance.pdf

Saeed Shahrivari and Saeed Jalili, “Beyond Batch Processing: Towards Real-Time and Streaming Big Data” https://arxiv.org/ftp/arxiv/papers/1403/1403.3375.pdf

Zeba Khanam and Shafali Agarwal, “Map-Reduce Implementations: Survey and Performance Comparison” http://airccse.org/journal/jcsit/7415ijcsit10.pdf

Apache Hadoop http://hadoop.apache.org/

Andrew C.Oliver, “Storm or Spark: Choose your real-time weapon” http://www.infoworld.com/article/2854894/application-development/spark-and-storm-for-real-time-computation.html

Dokumentacija Apache Spark http://spark.apache.org/docs/latest/

Dokumentacija Apache Storm http://storm.apache.org/releases/current/index.html

Dokumentacija Apache Kafka http://kafka.apache.org/documentation.html

Twitter Streaming API https://dev.twitter.com/streaming/overview

Apache Flume https://flume.apache.org/

Amazon Kinesis Streams https://aws.amazon.com/ru/kinesis/streams/

Apache Zookeeper https://zookeeper.apache.org/

Dokumentacija AWS EC2 https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

Apache Hadoop YARN https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html

Apache Mesos http://mesos.apache.org/

Matei Zaharia, Tathagata Das, et al., “Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing” https://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

Sanket Chintapalli, Derek Dagit, Bobby Evans, et al., “Benchmarking Streaming Computation Engines at Yahoo!” https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Apache Flink https://flink.apache.org/

Ishodnye kody testa proizvoditel'nosti ot Yahoo! https://github.com/yahoo/streaming-benchmarks

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, et al., “Class-Based n-gram Models of Natural Language” http://www.aclweb.org/anthology/J92-4003

Alberto Barr´on-Cede˜no and Paolo Rosso, “On Automatic Plagiarism Detection Based on n-Grams Comparison” http://users.dsic.upv.es/~prosso/resources/BarronRosso_ECIR09.pdf

William B. Cavnar and John M. Trenkle, “N-Gram-Based Text Categorization” http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf

David Sundby, “Spelling correction using N-grams” http://fileadmin.cs.lth.se/cs/education/EDA171/Reports/2009/david.pdf

Hosebird Client https://github.com/twitter/hbc

Twitter Apps https://apps.twitter.com/

Ishodnyj kod programmy-prodjusera, otpravljajushhej tvity v Kafka https://github.com/GorshkovNikita/kafka-test

Ishodnyj kod programm dlja dvizhkov Spark i Storm https://github.com/GorshkovNikita/streaming-engines-comparison

Jonathan Leibiusky, Gabriel Eisbruch and Dario Simonassi, “Getting Started with Storm”

Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia, “Learning Spark”


Refbacks

  • There are currently no refbacks.


IT-EDU-2017   RTUWO 2017

ISSN: 2307-8162