Linux kernel学习-进程基本

Sunday June 3rd, 2012 by Uranus Zhou

本文同步自（如浏览不正常请点击跳转）：https://zohead.com/archives/linux-kernel-learning-process/

Linux 中进程通过 fork() 被创建时，它差不多是和父进程一样的，它得到父进程的地址空间拷贝，运行和父进程一样的代码，从 fork() 的后面开始执行，父进程和子进程共享代码页，但子进程的 data 页是独立的（包括 stack 和 heap）。

早期的 Linux kernel 并不支持多线程的程序，从 kernel 来看，一个多线程的程序只是一个普通的进程，它的多个执行流应该完全在 user mode 来完成创建、处理、调度等操作，例如使用 POSIX pthread 库。当然这样的实现是无法让人满意的，Linux 为此使用轻量级进程为多线程程序提供更好的支持，两个轻量级进程可以共享资源（例如：地址空间、打开的文件等等），一个比较简单的方法是将为每个线程关联一个轻量级进程，这样每个线程可以被 kernel 单独调度，使用 Linux 轻量级进程的库有：LinuxThreads、NPTL、NGPT 等。Linux kernel 同时也支持线程组（可以理解为轻量级进程组）的概念。

1、进程描述符：

进程描述符由 task_struct 结构来表示，一般来说，每个可以被独立调度的执行上下文都必须有自己的进程描述符，因此尽管轻量级进程共享了很大一部分 kernel 数据结构，它也必须有自己的 task_struct。task_struct 中包含关于一个进程的差不多所有信息，它定义在 include/linux/sched.h 文件中，你会看到这是非常大的结构，其中还包含指向其它结构的指针。访问进程自身的 task_struct 结构，使用宏操作 current。

task_struct 中的 struct mm_struct *mm 即指向进程的地址空间。task_struct 的 state 字段表示进程的运行状态，取值有 TASK_RUNNING（正在运行或正在队列中等待运行，进程如果在用户空间只能为此状态）、TASK_INTERRUPTIBLE（可响应信号）、TASK_UNINTERRUPTIBLE（不响应信号）、TASK_STOPED 等，另外 state 还有特殊的两个值是 EXIT_ZOMBIE（僵尸进程）和 EXIT_DEAD（进程将被系统移除）。kernel 提供 set_task_state 宏修改进程状态，set_task_state 最终调用 set_mb，set_current_state 用于当前进程的状态。task_struct 的 pid 字段就是咱们喜闻乐见的进程 ID 了。

这是一个典型的 Linux 进程状态机图：

POSIX 1003.1c 标准规定一个多线程程序的每个线程都应该有相同的 PID，这样的好处是例如发一个信号给一个 PID，一个线程组里的所有线程都能收到。同一线程组中的线程有相同的线程组号（Thread Group ID），线程组组号放在 task_struct 的 tgid 成员变量中，一般是线程组里的第一个轻量级进程的 PID。特别需要注意 getpid() 系统调用返回的就是 tgid 的值，而不是 pid 值，这样一个多线程程序的所有线程可以共享一个 PID。

对每个进程，kernel 在通过 slab 分配器分配 task_struct 时，通常是实际分配了两个连续的物理页面（8KB），以 thread_union 联合表示，其中包括一个 thread_info 结构（其 task 成员是指向 task_struct 的指针）以及 kernel 模式的进程堆栈。esp CPU 堆栈指针即表示此进程堆栈的栈顶地址，进程从用户模式切换到 kernel 模式时，kernel 堆栈会被清空。为了效率考虑，kernel 会将这两个连续的物理页面的第一个页面按 2^13（也就是 8KB）对齐，为了避免内存较少时产生问题，kernel 提供配置选项（就是下面的 THREAD_SIZE 了）可以将 thread_info 和堆栈包含在一个页面也就是 4KB 的内存区域里。一般来说，8KB 的堆栈对于内核程序已经够用。

看看 Linux 2.6.34 中 thread_union 的定义：

union thread_union {
	struct thread_info thread_info;
	unsigned long stack[THREAD_SIZE/sizeof(long)];
};

由于 thread_info 和内核堆栈是合并在连续的页面里的，kernel 就可以从 esp 指针得到 thread_info 结构地址，这是通过 current_thread_info 函数来实现的。

/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __used;

/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
	return (struct thread_info *)
		(current_stack_pointer & ~(THREAD_SIZE - 1));
}

假设 thread_union 是 8KB 大小，也即 2^13，将 esp 的最低 13 位屏蔽掉即可得到 thread_info 的地址，如果是 4KB 的栈大小，屏蔽掉最低 12 位即可（和上面的代码一致），这样通过 current_thread_info()->task 就能得到当前的 task_struct，这就是 current 宏的实现了。

#define get_current() (current_thread_info()->task)
#define current get_current()

系统中进程的列表保存在 init_task 所在的双向链表中，task_struct 的 tasks 字段就是 list_head，init_task 表示的就是 PID 为 0 的 swapper 进程（或者叫 idle 进程），其 tasks 会依次指向下一个 task_struct，PID 为 1 的进程就是 init 进程，这两个进程都由 kernel 来创建。

而关于可以运行的进程的调度，Linux 2.6.34 和 ULK 上说的已经有很大的不同了。2.6.34 上加上了 struct sched_class 结构体表示不同类型的调度算法类，目前 2.6.34 上实现了三种：Completely Fair Scheduling (CFS) Class（完全公平算法，见 kernel/sched_fair.c）、Real-Time Scheduling Class（实时算法，见 kernel/sched_rt.c）和 idle-task scheduling class（见 kernel/sched_idletask.c），这三个源文件都被 include 在 kernel/sched.c 中进行编译了。CFS Class 使用 sched_entity 结构作为调度实体，其中包含权重、运行时间等信息，比 RT Class 复杂，其中还有专门的红黑树。RT Class 使用 sched_rt_entity 作为调度实体。

每个 task_struct 中都包含了 sched_entity 和 sched_rt_entity 这两个字段，sched_class 中则有 enqueue_task、dequeue_task 等函数指针指向对应调度算法中的实现函数，enqueue_task 将进程加入运行队列，dequeue_task 将进程从队列中移除，由于这段变化较大而且比较复杂，有关这三种调度算法的具体实现以后再来介绍了。

task_struct 的 real_parent 字段指向创建该进程的进程（如果父进程已不存在则为 init 进程），parent 指向当前进程的父进程，children 为该进程子进程列表，sibling 为该进程的兄弟进程列表，group_leader 字段指向该进程的线程组长。与 ULK 不同的是，ULK 中 ptrace_children 为被调试器 trace 的该进程的子进程列表，2.6.34 中 ptraced 字段包含该进程原本的子进程和 ptrace attach 的目标进程，ptrace_list 改为 ptrace_entry。另外 2.6.34 kernel 中已经引入 namespace 的概念，获得进程组 ID 和会话期 ID 的方式也于 ULK 中的有不少区别。

kernel 中进程的 PID 散列表存在 pid_hash 中以加快根据 PID 搜索 task_struct 的速度，pidhash_init 函数初始化此 PID 散列表，由于 2.6.34 中已有 namespace，pid_hashfn 也由原来的一个参数变为两个参数（增加一个 ns 参数表示哪个 namespace）。Linux kernel 也增加了 pid 和 upid 两个结构体，pid 是内核对进程 PID 的内部表示（惟一的），upid 是进程在特定的 namespace 中看到的 PID。

2、进程创建：

Linux 中使用 fork() 函数创建新进程，父进程的地址空间会复制给子进程，为了效率考虑，这个复制通过 COW（Copy-on-write）来实现，真正有写操作时才会复制。fork()、vfork()、__clone() 函数都是通过 clone() 系统调用来实现的，clone() 系统调用最终调用 do_fork()。需要注意的是 vfork() 的结果是子进程完全运行在父进程的地址空间上，父进程的页表项并不会被拷贝，而且子进程优先运行，父进程会一直阻塞直至子进程结束（调用 exec 执行新程序或者 _exit 退出，不可调用 exit 退出）。do_fork 函数定义在 kernel/fork.c 文件中，do_fork() 会再调用 copy_process() 函数。

可以看到 copy_process 是相当长的一个函数：

static struct task_struct *copy_process(unsigned long clone_flags,
					unsigned long stack_start,
					struct pt_regs *regs,
					unsigned long stack_size,
					int __user *child_tidptr,
					struct pid *pid,
					int trace)
{
	int retval;
	struct task_struct *p;
	int cgroup_callbacks_done = 0;

	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
		return ERR_PTR(-EINVAL);

	/*
	 * Thread groups must share signals as well, and detached threads
	 * can only be started up within the thread group.
	 */
	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
		return ERR_PTR(-EINVAL);

	/*
	 * Shared signal handlers imply shared VM. By way of the above,
	 * thread groups also imply shared VM. Blocking this case allows
	 * for various simplifications in other code.
	 */
	if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
		return ERR_PTR(-EINVAL);

	/*
	 * Siblings of global init remain as zombies on exit since they are
	 * not reaped by their parent (swapper). To solve this and to avoid
	 * multi-rooted process trees, prevent global and container-inits
	 * from creating siblings.
	 */
	if ((clone_flags & CLONE_PARENT) &&
				current->signal->flags & SIGNAL_UNKILLABLE)
		return ERR_PTR(-EINVAL);

	retval = security_task_create(clone_flags);
	if (retval)
		goto fork_out;

	retval = -ENOMEM;
	p = dup_task_struct(current);
	if (!p)
		goto fork_out;

	ftrace_graph_init_task(p);

	rt_mutex_init_task(p);

#ifdef CONFIG_PROVE_LOCKING
	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
	retval = -EAGAIN;
	if (atomic_read(&p->real_cred->user->processes) >=
			task_rlimit(p, RLIMIT_NPROC)) {
		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
		    p->real_cred->user != INIT_USER)
			goto bad_fork_free;
	}

	retval = copy_creds(p, clone_flags);
	if (retval < 0)
		goto bad_fork_free;

	/*
	 * If multiple threads are within copy_process(), then this check
	 * triggers too late. This doesn't hurt, the check is only there
	 * to stop root fork bombs.
	 */
	retval = -EAGAIN;
	if (nr_threads >= max_threads)
		goto bad_fork_cleanup_count;

	if (!try_module_get(task_thread_info(p)->exec_domain->module))
		goto bad_fork_cleanup_count;

	p->did_exec = 0;
	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
	copy_flags(clone_flags, p);
	INIT_LIST_HEAD(&p->children);
	INIT_LIST_HEAD(&p->sibling);
	rcu_copy_process(p);
	p->vfork_done = NULL;
	spin_lock_init(&p->alloc_lock);

	init_sigpending(&p->pending);

	p->utime = cputime_zero;
	p->stime = cputime_zero;
	p->gtime = cputime_zero;
	p->utimescaled = cputime_zero;
	p->stimescaled = cputime_zero;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
	p->prev_utime = cputime_zero;
	p->prev_stime = cputime_zero;
#endif
#if defined(SPLIT_RSS_COUNTING)
	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
#endif

	p->default_timer_slack_ns = current->timer_slack_ns;

	task_io_accounting_init(&p->ioac);
	acct_clear_integrals(p);

	posix_cpu_timers_init(p);

	p->lock_depth = -1;		/* -1 = no lock */
	do_posix_clock_monotonic_gettime(&p->start_time);
	p->real_start_time = p->start_time;
	monotonic_to_bootbased(&p->real_start_time);
	p->io_context = NULL;
	p->audit_context = NULL;
	cgroup_fork(p);
#ifdef CONFIG_NUMA
	p->mempolicy = mpol_dup(p->mempolicy);
 	if (IS_ERR(p->mempolicy)) {
 		retval = PTR_ERR(p->mempolicy);
 		p->mempolicy = NULL;
 		goto bad_fork_cleanup_cgroup;
 	}
	mpol_fix_fork_child_flag(p);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
	p->irq_events = 0;
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
	p->hardirqs_enabled = 1;
#else
	p->hardirqs_enabled = 0;
#endif
	p->hardirq_enable_ip = 0;
	p->hardirq_enable_event = 0;
	p->hardirq_disable_ip = _THIS_IP_;
	p->hardirq_disable_event = 0;
	p->softirqs_enabled = 1;
	p->softirq_enable_ip = _THIS_IP_;
	p->softirq_enable_event = 0;
	p->softirq_disable_ip = 0;
	p->softirq_disable_event = 0;
	p->hardirq_context = 0;
	p->softirq_context = 0;
#endif
#ifdef CONFIG_LOCKDEP
	p->lockdep_depth = 0; /* no locks held yet */
	p->curr_chain_key = 0;
	p->lockdep_recursion = 0;
#endif

#ifdef CONFIG_DEBUG_MUTEXES
	p->blocked_on = NULL; /* not blocked yet */
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
	p->memcg_batch.do_batch = 0;
	p->memcg_batch.memcg = NULL;
#endif

	p->bts = NULL;

	/* Perform scheduler related setup. Assign this task to a CPU. */
	sched_fork(p, clone_flags);

	retval = perf_event_init_task(p);
	if (retval)
		goto bad_fork_cleanup_policy;

	if ((retval = audit_alloc(p)))
		goto bad_fork_cleanup_policy;
	/* copy all the process information */
	if ((retval = copy_semundo(clone_flags, p)))
		goto bad_fork_cleanup_audit;
	if ((retval = copy_files(clone_flags, p)))
		goto bad_fork_cleanup_semundo;
	if ((retval = copy_fs(clone_flags, p)))
		goto bad_fork_cleanup_files;
	if ((retval = copy_sighand(clone_flags, p)))
		goto bad_fork_cleanup_fs;
	if ((retval = copy_signal(clone_flags, p)))
		goto bad_fork_cleanup_sighand;
	if ((retval = copy_mm(clone_flags, p)))
		goto bad_fork_cleanup_signal;
	if ((retval = copy_namespaces(clone_flags, p)))
		goto bad_fork_cleanup_mm;
	if ((retval = copy_io(clone_flags, p)))
		goto bad_fork_cleanup_namespaces;
	retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);
	if (retval)
		goto bad_fork_cleanup_io;

	if (pid != &init_struct_pid) {
		retval = -ENOMEM;
		pid = alloc_pid(p->nsproxy->pid_ns);
		if (!pid)
			goto bad_fork_cleanup_io;

		if (clone_flags & CLONE_NEWPID) {
			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
			if (retval < 0)
				goto bad_fork_free_pid;
		}
	}

	p->pid = pid_nr(pid);
	p->tgid = p->pid;
	if (clone_flags & CLONE_THREAD)
		p->tgid = current->tgid;

	if (current->nsproxy != p->nsproxy) {
		retval = ns_cgroup_clone(p, pid);
		if (retval)
			goto bad_fork_free_pid;
	}

	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
	/*
	 * Clear TID on mm_release()?
	 */
	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL;
#ifdef CONFIG_FUTEX
	p->robust_list = NULL;
#ifdef CONFIG_COMPAT
	p->compat_robust_list = NULL;
#endif
	INIT_LIST_HEAD(&p->pi_state_list);
	p->pi_state_cache = NULL;
#endif
	/*
	 * sigaltstack should be cleared when sharing the same VM
	 */
	if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
		p->sas_ss_sp = p->sas_ss_size = 0;

	/*
	 * Syscall tracing and stepping should be turned off in the
	 * child regardless of CLONE_PTRACE.
	 */
	user_disable_single_step(p);
	clear_tsk_thread_flag(p, TIF_SYSCALL_TRACE);
#ifdef TIF_SYSCALL_EMU
	clear_tsk_thread_flag(p, TIF_SYSCALL_EMU);
#endif
	clear_all_latency_tracing(p);

	/* ok, now we should be set up.. */
	p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
	p->pdeath_signal = 0;
	p->exit_state = 0;

	/*
	 * Ok, make it visible to the rest of the system.
	 * We dont wake it up yet.
	 */
	p->group_leader = p;
	INIT_LIST_HEAD(&p->thread_group);

	/* Now that the task is set up, run cgroup callbacks if
	 * necessary. We need to run them before the task is visible
	 * on the tasklist. */
	cgroup_fork_callbacks(p);
	cgroup_callbacks_done = 1;

	/* Need tasklist lock for parent etc handling! */
	write_lock_irq(&tasklist_lock);

	/* CLONE_PARENT re-uses the old parent */
	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
		p->real_parent = current->real_parent;
		p->parent_exec_id = current->parent_exec_id;
	} else {
		p->real_parent = current;
		p->parent_exec_id = current->self_exec_id;
	}

	spin_lock(&current->sighand->siglock);

	/*
	 * Process group and session signals need to be delivered to just the
	 * parent before the fork or both the parent and the child after the
	 * fork. Restart if a signal comes in before we add the new process to
	 * it's process group.
	 * A fatal signal pending means that current will exit, so the new
	 * thread can't slip out of an OOM kill (or normal SIGKILL).
 	 */
	recalc_sigpending();
	if (signal_pending(current)) {
		spin_unlock(&current->sighand->siglock);
		write_unlock_irq(&tasklist_lock);
		retval = -ERESTARTNOINTR;
		goto bad_fork_free_pid;
	}

	if (clone_flags & CLONE_THREAD) {
		atomic_inc(&current->signal->count);
		atomic_inc(&current->signal->live);
		p->group_leader = current->group_leader;
		list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
	}

	if (likely(p->pid)) {
		tracehook_finish_clone(p, clone_flags, trace);

		if (thread_group_leader(p)) {
			if (clone_flags & CLONE_NEWPID)
				p->nsproxy->pid_ns->child_reaper = p;

			p->signal->leader_pid = pid;
			tty_kref_put(p->signal->tty);
			p->signal->tty = tty_kref_get(current->signal->tty);
			attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
			attach_pid(p, PIDTYPE_SID, task_session(current));
			list_add_tail(&p->sibling, &p->real_parent->children);
			list_add_tail_rcu(&p->tasks, &init_task.tasks);
			__get_cpu_var(process_counts)++;
		}
		attach_pid(p, PIDTYPE_PID, pid);
		nr_threads++;
	}

	total_forks++;
	spin_unlock(&current->sighand->siglock);
	write_unlock_irq(&tasklist_lock);
	proc_fork_connector(p);
	cgroup_post_fork(p);
	perf_event_fork(p);
	return p;

bad_fork_free_pid:
	if (pid != &init_struct_pid)
		free_pid(pid);
bad_fork_cleanup_io:
	if (p->io_context)
		exit_io_context(p);
bad_fork_cleanup_namespaces:
	exit_task_namespaces(p);
bad_fork_cleanup_mm:
	if (p->mm)
		mmput(p->mm);
bad_fork_cleanup_signal:
	if (!(clone_flags & CLONE_THREAD))
		__cleanup_signal(p->signal);
bad_fork_cleanup_sighand:
	__cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
	exit_fs(p); /* blocking */
bad_fork_cleanup_files:
	exit_files(p); /* blocking */
bad_fork_cleanup_semundo:
	exit_sem(p);
bad_fork_cleanup_audit:
	audit_free(p);
bad_fork_cleanup_policy:
	perf_event_free_task(p);
#ifdef CONFIG_NUMA
	mpol_put(p->mempolicy);
bad_fork_cleanup_cgroup:
#endif
	cgroup_exit(p, cgroup_callbacks_done);
	delayacct_tsk_free(p);
	module_put(task_thread_info(p)->exec_domain->module);
bad_fork_cleanup_count:
	atomic_dec(&p->cred->user->processes);
	exit_creds(p);
bad_fork_free:
	free_task(p);
fork_out:
	return ERR_PTR(retval);
}

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      struct pt_regs *regs,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr)
{
	struct task_struct *p;
	int trace = 0;
	long nr;

	/*
	 * Do some preliminary argument and permissions checking before we
	 * actually start allocating stuff
	 */
	if (clone_flags & CLONE_NEWUSER) {
		if (clone_flags & CLONE_THREAD)
			return -EINVAL;
		/* hopefully this check will go away when userns support is
		 * complete
		 */
		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
				!capable(CAP_SETGID))
			return -EPERM;
	}

	/*
	 * We hope to recycle these flags after 2.6.26
	 */
	if (unlikely(clone_flags & CLONE_STOPPED)) {
		static int __read_mostly count = 100;

		if (count > 0 && printk_ratelimit()) {
			char comm[TASK_COMM_LEN];

			count--;
			printk(KERN_INFO "fork(): process `%s' used deprecated "
					"clone flags 0x%lx\n",
				get_task_comm(comm, current),
				clone_flags & CLONE_STOPPED);
		}
	}

	/*
	 * When called from kernel_thread, don't do user tracing stuff.
	 */
	if (likely(user_mode(regs)))
		trace = tracehook_prepare_clone(clone_flags);

	p = copy_process(clone_flags, stack_start, regs, stack_size,
			 child_tidptr, NULL, trace);
	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	if (!IS_ERR(p)) {
		struct completion vfork;

		trace_sched_process_fork(current, p);

		nr = task_pid_vnr(p);

		if (clone_flags & CLONE_PARENT_SETTID)
			put_user(nr, parent_tidptr);

		if (clone_flags & CLONE_VFORK) {
			p->vfork_done = &vfork;
			init_completion(&vfork);
		}

		audit_finish_fork(p);
		tracehook_report_clone(regs, clone_flags, nr, p);

		/*
		 * We set PF_STARTING at creation in case tracing wants to
		 * use this to distinguish a fully live task from one that
		 * hasn't gotten to tracehook_report_clone() yet.  Now we
		 * clear it and set the child going.
		 */
		p->flags &= ~PF_STARTING;

		if (unlikely(clone_flags & CLONE_STOPPED)) {
			/*
			 * We'll start up with an immediate SIGSTOP.
			 */
			sigaddset(&p->pending.signal, SIGSTOP);
			set_tsk_thread_flag(p, TIF_SIGPENDING);
			__set_task_state(p, TASK_STOPPED);
		} else {
			wake_up_new_task(p, clone_flags);
		}

		tracehook_report_clone_complete(trace, regs,
						clone_flags, nr, p);

		if (clone_flags & CLONE_VFORK) {
			freezer_do_not_count();
			wait_for_completion(&vfork);
			freezer_count();
			tracehook_report_vfork_done(p, nr);
		}
	} else {
		nr = PTR_ERR(p);
	}
	return nr;
}

copy_process 中会调用 dup_task_struct 先复制 task_struct 进程描述符，并做一些必要的改动，默认会去掉 flags 字段也即描述符标志中的PF_SUPERPRIV 值，表示进程没有使用超级用户权限，默认设置了PF_FORKNOEXEC 表示还没有执行 exec，设置了 PF_STARTING 标志表示进程正在被创建。copy_process 还会调用 copy_creds 复制进程凭证，这个以后再来专门研究，调用 task_io_accounting_init 初始化 I/O 统计，调用 cgroup_fork 将此进程加到父 cgroups 中，cgroups 是新 kernel 加入的一个比较重要的分组控制的机制，也是依赖 namespace 来实现的，有关 cgroups 以后将专门写一篇文章来介绍。

copy_process 然后会调用 sched_fork 为新创建的进程配置调度器，其中会将进程状态设为 TASK_WAKING 保证没人能运行它或者用信号之类将此新进程加入运行队列，并会调用 set_task_cpu 为进程选择一个空闲的 CPU 来运行，sched_fork 相关函数在 kernel/sched.c 中。

copy_process 然后会根据 clone_flags 来调用 copy_files、copy_fs 等来复制文件描述符、文件系统信息、信号处理函数等，copy_process 中调用 alloc_pid 给新进程分配 PID，copy_process 最终返回一个新的 task_struct。do_fork 最终会通过 task_pid_vnr 来返回子进程的 PID。

fork() 之后 Linux kernel 主观上会调用 wake_up_new_task 让子进程先运行，这样避免父进程有写地址空间的改动而可以减少 COW 的次数，但实际运行中并不一定完全会这样。

当使用 vfork() 函数创建进程时，task_struct 的 vfork_done 指向一个特定的地址，kernel 通过 init_completion 初始化等待，从上面的代码也可以看到父进程会一直调用 wait_for_completion 等待直到子进程通过 vfork_done 指针通知它，每个进程退出地址空间时（exec 或者 exit） mm_release() 函数中会检查 vfork_done 如果不为 NULL 父进程就能收到信号，这样就可以保证 vfork() 时必须是子进程先运行。

实际上由于 COW 的存在，vfork() 相对 fork() 惟一的好处只在于少了页表项的拷贝，现在也已经有了测试版的 COW 页表项 patch，这个就以后再抽空测试了。

3、Linux线程实现：

Linux kernel 的线程实现是比较特别的，它没有专门的线程概念，所有线程只是标准的进程，线程只是一个与其它进程共享资源的进程，这与 Windows、Solaris 等操作系统有明显的不同。

Linux 中线程的创建也是通过 clone() 系统调用来实现，只是增加了特别的参数：

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

上面的结果是地址空间、文件系统资源、文件描述符、信号处理都被共享了。

普通的 fork() 的调用方式为：

clone(SIGCHLD, 0);

vfork() 函数的调用方式为：

clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);

这样明显就能看到区别了。

另外来专门说下 kernel thread，kernel thread 是一种只在内核空间存在的进程，它不会发生上下文切换到用户空间，它也是可被调度和可被抢占的，它与普通进程的区别在于它没有地址空间，它的 task_struct 的 mm 指针（即进程地址空间）为 NULL。常见的 kernel thread 有 flush、ksoftirqd 等。

可以使用 kthread_create 函数创建 kernel thread，此函数定义在 kernel/kthread.c 中，此函数即返回 task_struct 指针，kernel thread 创建之后默认为 TASK_INTERRUPTIBLE 状态，需要使用 wake_up_process 来唤醒它，因此可以使用 kthread_run 直接创建并运行一个 kernel thread。kthread_stop 用于停止 kernel thread 执行。

看看 kthread_create 的代码：

struct task_struct *kthread_create(int (*threadfn)(void *data),
				   void *data,
				   const char namefmt[],
				   ...)
{
	struct kthread_create_info create;

	create.threadfn = threadfn;
	create.data = data;
	init_completion(&create.done);

	spin_lock(&kthread_create_lock);
	list_add_tail(&create.list, &kthread_create_list);
	spin_unlock(&kthread_create_lock);

	wake_up_process(kthreadd_task);
	wait_for_completion(&create.done);

	if (!IS_ERR(create.result)) {
		struct sched_param param = { .sched_priority = 0 };
		va_list args;

		va_start(args, namefmt);
		vsnprintf(create.result->comm, sizeof(create.result->comm),
			  namefmt, args);
		va_end(args);
		/*
		 * root may have changed our (kthreadd's) priority or CPU mask.
		 * The kernel thread should not inherit these properties.
		 */
		sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
		set_cpus_allowed_ptr(create.result, cpu_all_mask);
	}
	return create.result;
}

这时就要请出 PID 为 2 的 kthreadd 进程出场了，kernel thread 的创建是由 kthreadd 完成的。kthread_create 中会先把创建的信息 kthread_create_info 加到 kthread_create_list 链接中，然后唤醒 kthreadd 进程（其进程描述符保存在 kthreadd_task 全局变量中），并使用 wait_for_completion 等待 kthreadd 创建过程完成。

再看看 kthreadd 的实现：

int kthreadd(void *unused)
{
	struct task_struct *tsk = current;

	/* Setup a clean context for our children to inherit. */
	set_task_comm(tsk, "kthreadd");
	ignore_signals(tsk);
	set_cpus_allowed_ptr(tsk, cpu_all_mask);
	set_mems_allowed(node_states[N_HIGH_MEMORY]);

	current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;

	for (;;) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (list_empty(&kthread_create_list))
			schedule();
		__set_current_state(TASK_RUNNING);

		spin_lock(&kthread_create_lock);
		while (!list_empty(&kthread_create_list)) {
			struct kthread_create_info *create;

			create = list_entry(kthread_create_list.next,
					    struct kthread_create_info, list);
			list_del_init(&create->list);
			spin_unlock(&kthread_create_lock);

			create_kthread(create);

			spin_lock(&kthread_create_lock);
		}
		spin_unlock(&kthread_create_lock);
	}

	return 0;
}

kthreadd 在循环中检查 kthread_create_list 链表，会找到刚才的 kthread_create_info 结构，并将之从 kthread_create_list 链表中删除，然后调用 create_kthread 最终完成创建。

OK，既然都分析这么多了就再来 create_kthread：

static int kthread(void *_create)
{
	/* Copy data: it's on kthread's stack */
	struct kthread_create_info *create = _create;
	int (*threadfn)(void *data) = create->threadfn;
	void *data = create->data;
	struct kthread self;
	int ret;

	self.should_stop = 0;
	init_completion(&self.exited);
	current->vfork_done = &self.exited;

	/* OK, tell user we're spawned, wait for stop or wakeup */
	__set_current_state(TASK_UNINTERRUPTIBLE);
	create->result = current;
	complete(&create->done);
	schedule();

	ret = -EINTR;
	if (!self.should_stop)
		ret = threadfn(data);

	/* we can't just return, we must preserve "self" on stack */
	do_exit(ret);
}

static void create_kthread(struct kthread_create_info *create)
{
	int pid;

	/* We want our own signal handler (we take no signals by default). */
	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
	if (pid < 0) {
		create->result = ERR_PTR(pid);
		complete(&create->done);
	}
}

create_kthread 会调用 kernel_thread 创建新的进程，而且它的入口函数是 kthread 函数，参数为 create。kthread 中把自己设为 TASK_UNINTERRUPTIBLE 状态，并用 complete 告诉 kthread_create 创建好了，接着它调用 schedule() 函数使其所在进程进入睡眠状态，如果被唤醒（可以使用 wait_up_process 之类的了）就执行创建 kthread thread 时指定的入口函数。

而 kernel_thread 的实现则是平台相关的，它会调用 do_fork 创建新的进程，看看 x86 上的实现：

int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
	struct pt_regs regs;

	memset(&regs, 0, sizeof(regs));

	regs.si = (unsigned long) fn;
	regs.di = (unsigned long) arg;

#ifdef CONFIG_X86_32
	regs.ds = __USER_DS;
	regs.es = __USER_DS;
	regs.fs = __KERNEL_PERCPU;
	regs.gs = __KERNEL_STACK_CANARY;
#else
	regs.ss = __KERNEL_DS;
#endif

	regs.orig_ax = -1;
	regs.ip = (unsigned long) kernel_thread_helper;
	regs.cs = __KERNEL_CS | get_kernel_rpl();
	regs.flags = X86_EFLAGS_IF | 0x2;

	/* Ok, create the new process.. */
	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, &regs, 0, NULL, NULL);
}

由此也可以看到在 kernel 中创建 kernel thread 有两种方法：kthread_create 和 kernel_thread，它们的区别是：kthread_create 创建的 kernel thread 有完整的上下文环境，其父进程一定为 kthreadd，而 kernel_thread 的父进程可以为 init 或其它 kernel thread。

4、进程结束：

进程可以使用 exit 函数主动结束自己，也可以在收到信号或异常时非主动结束。不管怎样结束，最终都由 do_exit() 函数来完成，此函数定义在 kernel/exit.c 中，看看它的实现：

static void exit_notify(struct task_struct *tsk, int group_dead)
{
	int signal;
	void *cookie;

	/*
	 * This does two things:
	 *
  	 * A.  Make init inherit all the child processes
	 * B.  Check to see if any process groups have become orphaned
	 *	as a result of our exiting, and if they have any stopped
	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
	 */
	forget_original_parent(tsk);
	exit_task_namespaces(tsk);

	write_lock_irq(&tasklist_lock);
	if (group_dead)
		kill_orphaned_pgrp(tsk->group_leader, NULL);

	/* Let father know we died
	 *
	 * Thread signals are configurable, but you aren't going to use
	 * that to send signals to arbitary processes.
	 * That stops right now.
	 *
	 * If the parent exec id doesn't match the exec id we saved
	 * when we started then we know the parent has changed security
	 * domain.
	 *
	 * If our self_exec id doesn't match our parent_exec_id then
	 * we have changed execution domain as these two values started
	 * the same after a fork.
	 */
	if (tsk->exit_signal != SIGCHLD && !task_detached(tsk) &&
	    (tsk->parent_exec_id != tsk->real_parent->self_exec_id ||
	     tsk->self_exec_id != tsk->parent_exec_id))
		tsk->exit_signal = SIGCHLD;

	signal = tracehook_notify_death(tsk, &cookie, group_dead);
	if (signal >= 0)
		signal = do_notify_parent(tsk, signal);

	tsk->exit_state = signal == DEATH_REAP ? EXIT_DEAD : EXIT_ZOMBIE;

	/* mt-exec, de_thread() is waiting for us */
	if (thread_group_leader(tsk) &&
	    tsk->signal->group_exit_task &&
	    tsk->signal->notify_count < 0)
		wake_up_process(tsk->signal->group_exit_task);

	write_unlock_irq(&tasklist_lock);

	tracehook_report_death(tsk, signal, cookie, group_dead);

	/* If the process is dead, release it - nobody will wait for it */
	if (signal == DEATH_REAP)
		release_task(tsk);
}

NORET_TYPE void do_exit(long code)
{
	struct task_struct *tsk = current;
	int group_dead;

	profile_task_exit(tsk);

	WARN_ON(atomic_read(&tsk->fs_excl));

	if (unlikely(in_interrupt()))
		panic("Aiee, killing interrupt handler!");
	if (unlikely(!tsk->pid))
		panic("Attempted to kill the idle task!");

	tracehook_report_exit(&code);

	validate_creds_for_do_exit(tsk);

	/*
	 * We're taking recursive faults here in do_exit. Safest is to just
	 * leave this task alone and wait for reboot.
	 */
	if (unlikely(tsk->flags & PF_EXITING)) {
		printk(KERN_ALERT
			"Fixing recursive fault but reboot is needed!\n");
		/*
		 * We can do this unlocked here. The futex code uses
		 * this flag just to verify whether the pi state
		 * cleanup has been done or not. In the worst case it
		 * loops once more. We pretend that the cleanup was
		 * done as there is no way to return. Either the
		 * OWNER_DIED bit is set by now or we push the blocked
		 * task into the wait for ever nirwana as well.
		 */
		tsk->flags |= PF_EXITPIDONE;
		set_current_state(TASK_UNINTERRUPTIBLE);
		schedule();
	}

	exit_irq_thread();

	exit_signals(tsk);  /* sets PF_EXITING */
	/*
	 * tsk->flags are checked in the futex code to protect against
	 * an exiting task cleaning up the robust pi futexes.
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);

	if (unlikely(in_atomic()))
		printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
				current->comm, task_pid_nr(current),
				preempt_count());

	acct_update_integrals(tsk);
	/* sync mm's RSS info before statistics gathering */
	if (tsk->mm)
		sync_mm_rss(tsk, tsk->mm);
	group_dead = atomic_dec_and_test(&tsk->signal->live);
	if (group_dead) {
		hrtimer_cancel(&tsk->signal->real_timer);
		exit_itimers(tsk->signal);
		if (tsk->mm)
			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
	}
	acct_collect(code, group_dead);
	if (group_dead)
		tty_audit_exit();
	if (unlikely(tsk->audit_context))
		audit_free(tsk);

	tsk->exit_code = code;
	taskstats_exit(tsk, group_dead);

	exit_mm(tsk);

	if (group_dead)
		acct_process();
	trace_sched_process_exit(tsk);

	exit_sem(tsk);
	exit_files(tsk);
	exit_fs(tsk);
	check_stack_usage();
	exit_thread();
	cgroup_exit(tsk, 1);

	if (group_dead)
		disassociate_ctty(1);

	module_put(task_thread_info(tsk)->exec_domain->module);

	proc_exit_connector(tsk);

	/*
	 * FIXME: do that only when needed, using sched_exit tracepoint
	 */
	flush_ptrace_hw_breakpoint(tsk);
	/*
	 * Flush inherited counters to the parent - before the parent
	 * gets woken up by child-exit notifications.
	 */
	perf_event_exit_task(tsk);

	exit_notify(tsk, group_dead);
#ifdef CONFIG_NUMA
	mpol_put(tsk->mempolicy);
	tsk->mempolicy = NULL;
#endif
#ifdef CONFIG_FUTEX
	if (unlikely(current->pi_state_cache))
		kfree(current->pi_state_cache);
#endif
	/*
	 * Make sure we are holding no locks:
	 */
	debug_check_no_locks_held(tsk);
	/*
	 * We can do this unlocked here. The futex code uses this flag
	 * just to verify whether the pi state cleanup has been done
	 * or not. In the worst case it loops once more.
	 */
	tsk->flags |= PF_EXITPIDONE;

	if (tsk->io_context)
		exit_io_context(tsk);

	if (tsk->splice_pipe)
		__free_pipe_info(tsk->splice_pipe);

	validate_creds_for_do_exit(tsk);

	preempt_disable();
	exit_rcu();
	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;
	schedule();
	BUG();
	/* Avoid "noreturn function does return".  */
	for (;;)
		cpu_relax();	/* For when BUG is null */
}

do_exit 会将 task_struct 的 flags 字段设为 PF_EXITING（在 exit_signals 中设置），调用 acct_update_integrals 更新进程统计信息，调用 exit_mm 释放进程使用的地址空间（mm_struct），调用 exit_sem、exit_files、exit_fs 释放等待的信号量，递减文件描述符和文件系统的引用计数，task_struct 的 exit_code 被设置为对应的退出值，调用 exit_notify 通知进程的父进程，而在 exit_notify 函数中将此进程的子进程的父进程改为线程组中的另一个线程或者 init 进程，并将 exit_state 标记为 EXIT_ZOMBIE（需要父进程得到退出状态了），最终进程 state 状态被设置为 TASK_DEAD，然后调用 schedule() 以切换到新的进程上。

do_exit 完成之后，进程的描述符和 thread_info 都还是存在的，只是进程是 EXIT_ZOMBIE 状态而且不可运行，如果父进程通过 wait4 等系统调用处理完此进程的状态，此进程的 task_struct 和 thread_info 就会通过调用 release_task() 被最终释放，看看它的实现：

void release_task(struct task_struct * p)
{
	struct task_struct *leader;
	int zap_leader;
repeat:
	tracehook_prepare_release_task(p);
	/* don't need to get the RCU readlock here - the process is dead and
	 * can't be modifying its own credentials. But shut RCU-lockdep up */
	rcu_read_lock();
	atomic_dec(&__task_cred(p)->user->processes);
	rcu_read_unlock();

	proc_flush_task(p);

	write_lock_irq(&tasklist_lock);
	tracehook_finish_release_task(p);
	__exit_signal(p);

	/*
	 * If we are the last non-leader member of the thread
	 * group, and the leader is zombie, then notify the
	 * group leader's parent process. (if it wants notification.)
	 */
	zap_leader = 0;
	leader = p->group_leader;
	if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
		BUG_ON(task_detached(leader));
		do_notify_parent(leader, leader->exit_signal);
		/*
		 * If we were the last child thread and the leader has
		 * exited already, and the leader's parent ignores SIGCHLD,
		 * then we are the one who should release the leader.
		 *
		 * do_notify_parent() will have marked it self-reaping in
		 * that case.
		 */
		zap_leader = task_detached(leader);

		/*
		 * This maintains the invariant that release_task()
		 * only runs on a task in EXIT_DEAD, just for sanity.
		 */
		if (zap_leader)
			leader->exit_state = EXIT_DEAD;
	}

	write_unlock_irq(&tasklist_lock);
	release_thread(p);
	call_rcu(&p->rcu, delayed_put_task_struct);

	p = leader;
	if (unlikely(zap_leader))
		goto repeat;
}

它先调用 __exit_signal，其中会调用 __unhash_process，__unhash_process 调用 detach_pid 将 PID 从上面说的进程 PID hash 表（pid_hash）中删除。release_task 最终会调用 delayed_put_task_struct，其中再调用 put_task_struct，put_task_struct 再分别调用 free_thread_info 和 free_task_struct 来释放 thread_info 和 task_struct。

需要注意的是如果父进程在子进程之前退出，这时需要一个机制设置子进程的新父进程，不然子进程退出时将一直处于僵尸状态。因此进程退出时需要调用 exit_notify 进行通知处理，exit_notify 调用 forget_original_parent，其中再调用 find_new_reaper（意思不错，找到新收割者 ^_^）来设置新父进程，看看它的实现：

static struct task_struct *find_new_reaper(struct task_struct *father)
{
	struct pid_namespace *pid_ns = task_active_pid_ns(father);
	struct task_struct *thread;

	thread = father;
	while_each_thread(father, thread) {
		if (thread->flags & PF_EXITING)
			continue;
		if (unlikely(pid_ns->child_reaper == father))
			pid_ns->child_reaper = thread;
		return thread;
	}

	if (unlikely(pid_ns->child_reaper == father)) {
		write_unlock_irq(&tasklist_lock);
		if (unlikely(pid_ns == &init_pid_ns))
			panic("Attempted to kill init!");

		zap_pid_ns_processes(pid_ns);
		write_lock_irq(&tasklist_lock);
		/*
		 * We can not clear ->child_reaper or leave it alone.
		 * There may by stealth EXIT_DEAD tasks on ->children,
		 * forget_original_parent() must move them somewhere.
		 */
		pid_ns->child_reaper = init_pid_ns.child_reaper;
	}

	return pid_ns->child_reaper;
}

它先用 while_each_thread 在线程组中找一个进程，找到就直接返回，找不到就返回 init 进程了，init 进程会自动调用 wait() 来等待它的子进程。

至此 Linux kernel 中的进程基本实现大概了解了，将了进程和线程概念、进程创建和退出等做了代码上的分析介绍，下面就是专门的进程调度了，有任何问题欢迎指正哦~~~ ^_^

Posted in kernel, Linux, Technology | No comments

Previous post: Linux中拷贝目录跳过指定文件的方法
Next post: 一个比较烂的主机空间 - GigaWebHost

Leave a comment: Name (required)

Mail (required, but not published)

Website

Comment

CAPTCHA Code*

Notify me of follow-up comments via e-mail