高级特性
Enum
枚举
Exception
异常捕获
File
文件操作
- 读写操作
- 目录操作
- 序列化操作
Concurrency
多进程 multiprocessing
一般是使用multiprocessing的Pool来处理,涉及大量计算时,比如LLM Generation,使用多进程更加合适
多线程 threading / threadLocal
一般是使用threading的Thread库,处理IO密集型操作,比如文件读写,数据库查询等磁盘相关的长耗时、轻量化任务中
Async
可以是多线程+异步,也可以是多进程+异步
正则
基础包:re 高级API包:regex
A regular expression (or RE) specifies a set of strings that matches it.
- Unicode string(str)和8-bit
- 使用r string来表示raw data
- 调用API可以使用regex包
- 复杂的大型正则可以拆分为多个小型正则
- before a newline == end of the line selected
Grammar
. 默认不匹配newline. If the DOTALL flag has been specified, this matches any character including a newline.
$ searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
non-greedy 非贪婪模式
Multiline 多行模式
*+?默认为贪婪模式
非贪婪模式 *? +? ??
{m,n}? 尽可能少的匹配,有m个就不匹配n个
{m,n}+ 尽可能多的匹配,有n个就不匹配m个,非回溯匹配
Example
a{4,}b
pip
参数
-i 指定镜像()
-e 本地安装(--editable 可编辑安装)
镜像源
更换镜像源
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip config unset global.index-url
whl安装失败:
尝试使用预编译版本
pip install <pkg> --prefer-binary
尝试下载源码,然后通过pip安装
OOP
常用属性
@property 将方法变为属性
Mixin多继承
__init__
__str__
__repr__
__iter__
__getitem__
__getattr__
__call__
metaclass
__slots__ 对动态语言进行一定的限制
class xxx():
__slots__ = ("","")
常用方法
type()
isinstance()
dir()
ipython
1 !使用命令行命令
!vi xxx 通过vim创建python文件
2 %run xxx 运行python文件
3 %timeit xxx 显示运行时间
4 %pdb 开启调试模式 ————当有一行代码出错时,回到上一行,进入调试模式
on
off
p (var) 快速输出变量值
c 恢复到ipython
q 退出
a 当前函数参数
w 完整站栈追踪
l 当前行及其栈追踪
5 %paste (粘贴)————运行粘贴代码
6 %bookmark 指令 值 存储自己常用指令
-l
-d
-r 删除所有
其它:
_:上n行的输出,是对结果的值引用
__*__? 命名空间搜索
命令行历史搜索:a + ↑
numpy
python
可需求的满足:
量化交易:
跨国公司市值转换
求物品总金额
核心使用:创建多维数组对象,进行数字运算。默认计算为小数级别。
核心命令;
import numpy as np
np.array() ————后可直接对数组进行四则运算。
arr_product = np.array(arr_product)
arr_num = np.array(arr_num)
↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
sum = (arr_product * arr_num).sum()
多维:
arr.size =>: 2 * 2 * 3 => 2 * 3
arr.shape =>:(2,2,3)两页两行三列的三维数组 => (2,3)两行三列的二维数组
命令:
~.dtype 查看类型
~.size
~.shape
~.T 行列式转换
~.ndim 查看数组维数
~.zeros(n,dtype='')创建全0数组
~.ones(n,dtype='')创建全1数组
~.empty(n) 创建空数组,随机值
~.nrange(start,end,step)numpy版本的range(),步长可以是小数
~.linesapce(start_real_number,end_real_number,how_many_parts_of_this_array)
~.eye(n)线性代数单位矩阵
(数组&&列表区别:数组的类型相同,大小固定)
GPT学习记录
文档
- A well-written prompt provides enough information for the model to know what you want and how it should respond. - default models' training data cuts off in 2021 - if all the returned samples have finish_reason == "length", it's likely that max_tokens is too small and model runs out of tokens before it manages to connect the prompt and the suffix naturally. Consider increasing max_tokens before resampling. Basics There are three basic guidelines to creating prompts: Show and tell. Make it clear what you want either through instructions, examples, or a combination of the two. If you want the model to rank a list of items in alphabetical order or to classify a paragraph by sentiment, show it that's what you want. Provide quality data. If you're trying to build a classifier or get the model to follow a pattern, make sure that there are enough examples. Be sure to proofread your examples — the model is usually smart enough to see through basic spelling mistakes and give you a response, but it also might assume this is intentional and it can affect the response. Check your settings. The temperature and top_p settings control how deterministic the model is in generating a response. If you're asking it for a response where there's only one right answer, then you'd want to set these lower. If you're looking for more diverse responses, then you might want to set them higher. The number one mistake people use with these settings is assuming that they're "cleverness" or "creativity" controls.
重要参数
重要参数: 给予的数据,数据源。数量是否够多,质量是否够大 Settings,是否设置正确,是想要聚还是想要散 影响确定性(信息的聚散) temperature 0越确定(deterministic) 数值越高可能性越多 top_p 同上 影响内容链接度/断裂 max_tokens 越大发送的信息越多。过小会导致信息断裂 finish_reason == 'stop' resample >= 3 && resample <= 5 (or using best_of with k=3,5) endpoint——给出需要遵循的模式,并且确定何时停止 提供额外的确切信息——优先级 Extra Message(EM) to help GPT better understand————为model提供有用的上下文
Response format
最后返回的信息就是response['choices'][0]['message']['content'] 返回信息: { 'id': 'chatcmpl-6p9XYPYSTTRi0xEviKjjilqrWU2Ve', 'object': 'chat.completion', 'created': 1677649420, 'model': 'gpt-3.5-turbo', 'usage': {'prompt_tokens': 56, 'completion_tokens': 31, 'total_tokens': 87}, 'choices': [ { 'message': { 'role': 'assistant', 'content': 'The 2020 World Series was played in Arlington, Texas at the Globe Life Field, which was the new home stadium for the Texas Rangers.'}, 'finish_reason': 'stop', 'index': 0 } ] }
参数详解: finish_reason: stop: API returned complete model output length: Incomplete model output due to max_tokens parameter or token limit content_filter: Omitted content due to a flag from our content filters null: API response still in progress or incomplete
Token管理: Language models read text in chunks called tokens. In English, a token can be as short as one character or as long as one word (e.g., a or apple), and in some languages tokens can be even shorter than one character or even longer than one word. For example, the string "ChatGPT is great!" is encoded into six tokens: ["Chat", "G", "PT", " is", " great", "!"]. token越大,开销越大 token越大,花费时间越长,内容越匹配 token最大4096 输入和输出都记作token: Both input and output tokens count toward these quantities. For example, if your API call used 10 tokens in the message input and you received 20 tokens in the message output, you would be billed for 30 tokens. 查看token:response['usage']['total_tokens']