`

自定义InputFormat

 
阅读更多

 

 

0 引子:

 

InputFormt的各种实现类是针对不同数据源来定义的,

比如针对文件类型的FileInputFormat ,针对DB的DBInputFormat,

如果数据源是别的样子的,那么该怎么做?

这时候就需要自定义InputFormat了

 

参考图:

 

 

 

需要明确的一个流程:

 

一个split ----> 对应一个map任务,

一个split-----> 在经过redordreader处理后,会产生很多<k1,v1>对,每一对都会调用mapp.class.map()方法

 

 

 

 

1 模拟FileInputFormat , FileSplit , LineRecordReader 来自定义 InputFormat实例,

从内存中获取数值进行mapreduce计算:

 

package inputformat;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileInputStream;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
   自定义InputFormat,用于解析指定类型的数据源,下面案例中
 * 数据源来自于内存(不是来在于hdfs / db )
 * 在执行map  reduce任务时,因为数据都是随机产生的, 因此执行的map 和 reduce任务时,  没有合并
 * 
 * 下面案例中 自定义 split时,产生了三个split, 每个split数据源为 [长度为10的数组],每个split都会对应一个map任务,
 * 每个split的数组最后通过自定义recordreader后,最后生成10个<k,v>,这10个<k,v>会依次调用mapper.map方法
 * 这个流程可以通过运行此案例,看System.out.println("MyMapper map value is " + line);即可推到到
 * 
 * 
 *fileInputFormat中,
 *一个split对应64M大小的hdfs文件,类比单词计数功能中一个hello文件占64M大小,加入者64M有6400行
 *那么最后通过LineRecordReader生成 6400个<k,v> 每个<k,v>分别代表<当前行起始位置,当前行文本内容>
 *这 6400个<k,v>会调用mapper.class.map()方法6400次
 * 
 * 结果如下:
 * MyMapper map value is Text53348
14/12/03 13:17:49 INFO mapred.MapTask: Starting flush of map output
MyMapper map value is Text320473
MyMapper map value is Text320021
MyMapper map value is Text768245
MyMapper map value is Text642733
MyMapper map value is Text48407
MyMapper map value is Text789931
MyMapper map value is Text651215
MyMapper map value is Text552616
MyMapper map value is Text968669
14/12/03 13:17:50 INFO mapred.MapTask: Finished spill 0
14/12/03 13:17:50 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
14/12/03 13:17:50 INFO mapred.LocalJobRunner: 
14/12/03 13:17:50 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
14/12/03 13:17:50 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
14/12/03 13:17:50 INFO mapred.MapTask: io.sort.mb = 100
14/12/03 13:17:50 INFO mapred.MapTask: data buffer = 79691776/99614720
14/12/03 13:17:50 INFO mapred.MapTask: record buffer = 262144/327680
MyMapper map value is Text150861
MyMapper map value is Text428272
MyMapper map value is Text695122
MyMapper map value is Text401944
MyMapper map value is Text405576
MyMapper map value is Text651821
MyMapper map value is Text497050
MyMapper map value is Text447011
MyMapper map value is Text918767
MyMapper map value is Text567241
14/12/03 13:17:50 INFO mapred.MapTask: Starting flush of map output
14/12/03 13:17:50 INFO mapred.MapTask: Finished spill 0
14/12/03 13:17:50 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
14/12/03 13:17:50 INFO mapred.LocalJobRunner: 
14/12/03 13:17:50 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
14/12/03 13:17:50 INFO mapred.Task:  Using ResourceCalculatorPlugin : null
14/12/03 13:17:50 INFO mapred.MapTask: io.sort.mb = 100
14/12/03 13:17:50 INFO mapred.MapTask: data buffer = 79691776/99614720
14/12/03 13:17:50 INFO mapred.MapTask: record buffer = 262144/327680
MyMapper map value is Text403321
MyMapper map value is Text600217
MyMapper map value is Text387387
MyMapper map value is Text63766
MyMapper map value is Text240805
MyMapper map value is Text247539
MyMapper map value is Text901107
MyMapper map value is Text766337
MyMapper map value is Text523199
MyMapper map value is Text780722





Text205845      1
Text213736      1
Text23425       1
Text267465      1
Text287345      1
Text287679      1
Text297910      1
Text311346      1
Text341091      1
Text418038      1
Text491331      1
Text523116      1
Text528894      1
Text621959      1
Text641714      1
Text64916       1
Text660309      1
Text699375      1
Text713395      1
Text754231      1
Text788194      1
Text812630      1
Text817771      1
Text862128      1
Text870210      1
Text916419      1
Text919783      1
Text932819      1
Text93461       1
Text974656      1

 */
public class MyselInputFormatApp {
	private static final String OUT_PATH = "hdfs://master:9000/out";

	public static void main(String[] args) throws Exception{
		Configuration conf = new Configuration();
		final FileSystem filesystem = FileSystem.get(new URI(OUT_PATH), conf);
		if(filesystem.exists(new Path(OUT_PATH))){
			filesystem.delete(new Path(OUT_PATH), true);
		}
		
		final Job job = new Job(conf , MyselInputFormatApp.class.getSimpleName());
		job.setJarByClass(MyselInputFormatApp.class);
		
		job.setInputFormatClass(MyselfMemoryInputFormat.class);
		job.setMapperClass(MyMapper.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);
		
		job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
		FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));
		
		job.waitForCompletion(true);
	}
	
	public static class MyMapper extends Mapper<NullWritable, Text, Text, LongWritable>{
		//
		protected void map(NullWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<NullWritable,Text,Text,LongWritable>.Context context) throws java.io.IOException ,InterruptedException {
			
			/**
			 * 按下面自定义inputformat 自定义split(构造产生内存中的数据,加入为1,2,3..10)和自定义 redordreader 最后 这个split到 map方法时是数据为:
			 * <null,1><null,2>...<null,10>
			 * 那么在执行map后最终输出结果为: <1,1> <2,1>  ....<10,1>
			 */
			final String line = value.toString();
			System.out.println("MyMapper map value is " + line);
			final String[] splited = line.split("\t");
			
			
			for (String word : splited) {
				context.write(new Text(word), new LongWritable(1));
			}
		};
	}
	
	
	
	
	//map产生的<k,v>分发到reduce的过程称作shuffle
	public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
		protected void reduce(Text key, java.lang.Iterable<LongWritable> values, org.apache.hadoop.mapreduce.Reducer<Text,LongWritable,Text,LongWritable>.Context context) throws java.io.IOException ,InterruptedException {
			//count表示单词key在整个文件中的出现次数
			long count = 0L;
			for (LongWritable times : values) {
				count += times.get();
			}
			context.write(key, new LongWritable(count));
		};
	}
	
	/**
	 * 从内存中产生数据,然后解析成一个个的键值对
	 * 这里自定义MyselfMemoryInputFormat时,指定最后处理好形成的<key1,value1>类型为 <NullWritable,Text>
	 */
	public static class MyselfMemoryInputFormat extends InputFormat<NullWritable, Text>{

		@Override
		public List<InputSplit> getSplits(JobContext context)
				throws IOException, InterruptedException {
			final ArrayList<InputSplit> result = new ArrayList<InputSplit>();
			// splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));  参考FileInputFormat.getSplits的写法,构造出如下
			// 人工构造三个split,这样会产生三个map任务
			result.add(new MemoryInputSplit());
			result.add(new MemoryInputSplit());
			result.add(new MemoryInputSplit());
			
			return result;
		}

		// 指定解析split成<k1,v1>的处理类
		@Override
		public RecordReader<NullWritable, Text> createRecordReader(
				InputSplit split, TaskAttemptContext context)
				throws IOException, InterruptedException {
			return new MemoryRecordReader();
		}

	}
	// 参考FileInputFormat.getSplits的写法, splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
	// 上面是通过读取hdfs获取数据源,下面我们通过产生一些随机数字来模拟产生数据源放在内存中。
	// 需要实现序列化 split需要在节点之间传递
	public static class MemoryInputSplit extends InputSplit implements Writable{
		final int SIZE = 10;
		final ArrayWritable arrayWritable = new ArrayWritable(Text.class);
		
		/**
		 * 先创建一个java数组类型,然后转化为hadoop的数组类型 ,一个split内能获取诸多数据,这里自定义中,我们将随机产生的10个数来模拟split读取到的数据
		 */
		public MemoryInputSplit() {
			Text[] array = new Text[SIZE];
			
			final Random random = new Random();
			for (int i = 0; i < SIZE; i++) {
				final int nextInt = random.nextInt(999999);
				final Text text = new Text("Text"+nextInt);
				array[i] = text;
			}
			
			arrayWritable.set(array);
		}
		
		@Override
		public long getLength() throws IOException, InterruptedException {
			return SIZE;
		}

		@Override
		public String[] getLocations() throws IOException, InterruptedException {
			return new String[] {"localhost"};
		}

		public ArrayWritable getValues() {
			return arrayWritable;
		}

		// 针对 Writable 需要重写的两个方法  将需要传输的数据进行读入和写出
		/**
		 *  参考FileSplit的 write read 方法,是将目标数据的文件路径和文件读写起始位置写出去,实际上就是讲目标地址内的数据写出去 因此我们自定义中用arrayWritable.write(out);
		 *  @Override
  public void write(DataOutput out) throws IOException {
    Text.writeString(out, file.toString());
    out.writeLong(start);
    out.writeLong(length);
  }
		 */
		@Override
		public void write(DataOutput out) throws IOException {
			arrayWritable.write(out);
		}

		@Override
		public void readFields(DataInput in) throws IOException {
			arrayWritable.readFields(in);
		}
		
		
	}
	
	public static class MemoryRecordReader extends RecordReader<NullWritable, Text>{
		Writable[] values = null;
		Text value = null;
		int i = 0;
		// 自定义RecordReader中,在初始化后,需要得到对接过来的split的数据, 因此在自定义split中增加自己的方法getValues()
		@Override
		public void initialize(InputSplit split, TaskAttemptContext context)
				throws IOException, InterruptedException {
			MemoryInputSplit inputSplit = (MemoryInputSplit)split;
			ArrayWritable writables = inputSplit.getValues();
			this.values = writables.get();
			this.i = 0;
		}

		@Override
		public boolean nextKeyValue() throws IOException, InterruptedException {
			if(i>=values.length) {//values-->[1,2,3,4,5,6,7,8,9,10]那么在recordreader时,产生键值对为<null,1> <null,2>...<null,10>
				return false;
			}
			if(this.value==null) {
				this.value = new Text();
			}
			this.value.set((Text)values[i]);
			i++;
			return true;
		}

		@Override
		public NullWritable getCurrentKey() throws IOException,
				InterruptedException {
			return NullWritable.get();
		}

		@Override
		public Text getCurrentValue() throws IOException, InterruptedException {
			return value;
		}

		@Override
		public float getProgress() throws IOException, InterruptedException {
			return 0;
		}

		@Override
		public void close() throws IOException {
			
		}
		
	}
}

 

 

分享到:
评论

相关推荐

    自定义inputFormat&&outputFormat1

    自定义inputFormat&&outputFormat1

    自定义MapReduce的InputFormat

    自定义MapReduce的InputFormat,实现提取指定开始与结束限定符的内容。

    mapreduce_training:用于教学目的的MapReduce应用程序集

    MapReduce自定义InputFormat和RecordReader实现 MapReduce自定义OutputFormat和RecordWriter实现 Pig自定义LoadFunc加载和解析Apache HTTP日志事件 Pig的自定义EvalFunc使用MaxMind GEO API将IP地址转换为位置 另外...

    SequenceFileKeyValueInputFormat:自定义 Hadoop InputFormat

    Apache Hive 的 InputFormat,在查询 SequenceFiles 时将返回 (Text) 键和 (Text) 值。 我需要在不拆分内容的情况下完整解析大量文本文件。 HDFS 在处理大型连续文件时提供最佳吞吐量,因此我使用 Apache Mahout 将...

    demo-merge-files

    演示合并文件 这个例子展示了如何使用 MapReduce 范式来合并来自不同文件的行。 主要缺点是每个输入文件应用一个映射器而不拆分以防止行偏斜到 Reduce 阶段。... 最好为此场景实现自定义 InputFormat 类。

    探索HadoopOutputFormat

    上个月InfoQ展示了怎样在第一个步骤中,使用InputFormat类来更好地对接收输入文件进行控制。而在本文中,我们将同大家一起探讨怎样自定义最后一个步骤——即怎样写入输出文件。OutputFormat将Map/Reduce作业的输出...

    大数据知识脉络总结

    InputFormat Map 输入:偏移量、一行数据 输出:Text,Text 自定义对象 shuffle 数据从map端拉取 归并(汇总) 排序 Reduce 输入:Text,list 输出:Text,Text OutputForamt 特殊组件 partitioner ...

    el-form-renderer:data数据驱动的动态复杂表格解决方案

    形式呈现器目录灵感贡献贡献者执照 介绍 ... 特征用JSON渲染表单支持与自定义组件集成使用updateForm方法支持批量更新表单数据支持setOptions方法,动态更改选择选项内容支持inputFormat , outputFormat , trim以处理

    ecosystem:TensorFlow与其他开源框架的集成

    TensorFlow生态系统 该存储库包含将TensorFlow与其他开源框架集成的示例。... -Hadoop MapReduce和Spark的TFRecord文件InputFormat / OutputFormat。 -Spark TensorFlow连接器 -Python软件包,可帮助用户使用TensorF

Global site tag (gtag.js) - Google Analytics