[Golang] 容器

当然了,这里的容器不是CNI之类的运行容器,而是指Golang中存储和组织数据的方式。

阅读更多

[Golang] 函数

Go语言中的函数特性:

  • 函数本身可以作为值进行传递
  • 支持匿名函数和闭包(closure)
  • 函数可以满足接口
阅读更多

[Tips] 终端10X工作法

此篇文章转载legendtkl的博客终端10X工作法,我一直很喜欢terminal使用中带来的一些小tricks,这篇文章刚好列举了一些代表性的,之后我会在这篇文章的基础上,将我比较常用的记录并放到顶部,也算是方便查找。

另外也非常感谢legendtkl的文章合抱之木,生于毫末,也算是我开始写博客的原因之一。

以下为正文内容。

阅读更多

[NOTE] ClockSynchronize

Ubuntu

  • Check your current time-zone: cat /etc/timezone
  • Check if your clock is synchronized with the internet: timedatectl
1
2
3
4
5
6
7
8
root@openstack1:~/kubernetes# timedatectl
Local time: Tue 2019-12-17 10:30:02 UTC
Universal time: Tue 2019-12-17 10:30:02 UTC
RTC time: Tue 2019-12-17 10:30:02
Time zone: Etc/UTC (UTC, +0000)
System clock synchronized: yes
systemd-timesyncd.service active: yes
RTC in local TZ: no

If you haven’t sychronized with internet, the value of System clock synchronize must be no

Use sudo systemctl restart systemd-timesyncd.service to activate timesyncd

1
2
3
4
5
6
7
8
9
10
root@openstack1:~/kubernetes# systemctl status systemd-timesyncd.service
● systemd-timesyncd.service - Network Time Synchronization
Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendo
Active: active (running) since Mon 2019-12-16 09:24:02 UTC; 1 day 1h ago
Docs: man:systemd-timesyncd.service(8)
Main PID: 27584 (systemd-timesyn)
Status: "Synchronized to time server 91.189.94.4:123 (ntp.ubuntu.com)."
Tasks: 2 (limit: 4915)
CGroup: /system.slice/systemd-timesyncd.service
└─27584 /lib/systemd/systemd-timesyncd
  • The final key to synchronize the clock

sudo timedatectl set-ntp true

  • Switch your time-zone

timedatectl list-timezones to check all the time-zones.

sudo timedatectl set-timezone Asia/Shanghai

CentOS

  • Using command line

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # installation
    yum install chrony
    # enable chronyd
    systemctl start chronyd
    systemctl enable chronyd
    # set timezone to Shanghai
    timedatectl set-timezone Asia/Shanghai
    # launch it
    timedatectl set-ntp yes

[NOTE] Kaggle house price prediction

Abstracts

This is a copycat of Comprehensive data explorationi with Python. Since I had limited time to accomplish AI project. So I preferred learn from other’s notebook. And ‘Comprehensive data explorationi with Python’ is apparently the most fit one for me.

According to the article, the first thing we should do is look through the whole data set, and find the most important variables which matters when you buy a house.

And then an important problem we must deal with is Data Cleaning.

Overviews

While ‘Type’ and ‘Segment’ is just for possible future reference, the column ‘Expectation’ is important because it will help us develop a ‘sixth sense’. To fill this column, we should read the description of all the variables and, one by one, ask ourselves:

  • Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its ‘Masonry veneer type’?).
  • If so, how important would this variable be? (e.g. What is the impact of having ‘Excellent’ material on the exterior instead of ‘Poor’? And of having ‘Excellent’ instead of ‘Good’?).
  • Is this information already described in any other variable? (e.g. If ‘LandContour’ gives the flatness of the property, do we really need to know the ‘LandSlope’?).

I went through this process and concluded that the following variables can play an important role in this problem:

  • OverallQual 总体质量
  • YearBuilt.
  • TotalBsmtSF. 地下室面积
  • GrLivArea. 地上居住面积

Hmmm… It seems that ‘SalePrice’ and ‘GrLivArea’ are really old friends, with a *linear relationship.***

In my opinion, this heatmap is the best way to get a quick overview of our ‘plasma soup’ and its relationships. (Thank you @seaborn!)

correlation.png

  • TotalBsmtSF and 1stFlrSF
  • GarageX

saleprice_correlation_matrix.png

According to our crystal ball, these are the variables most correlated with ‘SalePrice’. My thoughts on this:

  • ‘OverallQual’, ‘GrLivArea’ and ‘TotalBsmtSF’ are strongly correlated with ‘SalePrice’. Check!
  • ‘GarageCars’ and ‘GarageArea’ are also some of the most strongly correlated variables. However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area. ‘GarageCars’ and ‘GarageArea’ are like twin brothers. You’ll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep ‘GarageCars’ since its correlation with ‘SalePrice’ is higher).
  • ‘TotalBsmtSF’ and ‘1stFloor’ also seem to be twin brothers. We can keep ‘TotalBsmtSF’ just to say that our first guess was right (re-read ‘So… What can we expect?’).
  • ‘FullBath’?? Really?
  • ‘TotRmsAbvGrd’ and ‘GrLivArea’, twin brothers again. Is this dataset from Chernobyl?
  • Ah… ‘YearBuilt’… It seems that ‘YearBuilt’ is slightly correlated with ‘SalePrice’. Honestly, it scares me to think about ‘YearBuilt’ because I start feeling that we should do a little bit of time-series analysis to get this right. I’ll leave this as a homework for you.

Let’s proceed to the scatter plots.

scatterplot.png

Missing data

Missing data analysis

Using script below, we can easily get the missing data.

1
2
3
4
5
#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(20))
Total Percent
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000

So how to handle the missing data?

We’ll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. So we delete ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’ and ‘LotFrontage’.

As for ‘GarageX’, they all have the same number of missing data. Maybe the missing data refers to the same set of observations. Since the most important information regarding garages is expressed by ‘GarageCars’ and considering that we are just talking about 5% of missing data, I’ll delete the mentioned ‘GarageX‘ variables. The same logic applies to ‘BsmtX‘ variables.

As for ‘MasVnrArea’(砖石饰面面积) and ‘MasVnrType’(砖石饰面种类), we can consider that these variables have a strong correlation with ‘YearBuilt’ and ‘OverallQual’ which are already considered. So we delete ‘MasVnrArea’ and ‘MasVnrType’.

Delete missing variables

We’ll delete all the variables with missing data, except the variable ‘Electrical’. In ‘Electrical’ we’ll just delete the observation with missing data.

1
2
3
4
#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max() #just checking that there's no missing data missing...

If the output is ‘0’, it means you have fully delete missing data.

Out liars

The primary concern here is to establish a threshold that defines an observation as an outlier. To do so, we’ll standardize the data. In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1.

1
这里主要关注的是建立一个将观察值定义为异常值的阈值。为此,我们将对数据进行标准化。在这种情况下,数据标准化意味着将数据值转换为平均值为0且标准差为1。
1
2
3
4
5
6
7
8
#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)

这一步的目的应该是为了找出数据中的离群值,这里需要关注的是两个大于7的变量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#bivariate analysis saleprice/grlivarea
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

#deleting points
df_train.sort_values(by = 'GrLivArea', ascending = False)[:2]
df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)
df_train = df_train.drop(df_train[df_train['Id'] == 524].index)

#bivariate analysis saleprice/grlivarea
var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

将’GrlivArea’中的离群值删除。

之后考察’TotalBsmtSF’中的离群值,但它的离群值表现在可以接受的范围之内。

Getting hard core

According to Hair et al. (2013), four assumptions should be tested:

  • Normality: The data should look like a normal distribution.
  • Homoscedasticity: 这个用英文解释不太好懂,同方性是可取的,我们希望误差项在自变量的所有值上都相同;
  • Linearity: 正如前文已经做过的,通过散点图的方法来观测两个变量之间是否有线性的相关性,如果相关性不是线性的,那么可以通过一定的数学转换使其线性相关;
  • Absence of correlated errors:

Normality

The point here is to test ‘SalePrice’ in a very lean way. We’ll do this paying attention to:

  • Histogram - Kurtosis and skewness.
  • Normal probability plot - Data distribution sould closely follow the diagonal that represents the normal distribution.
1
2
3
4
#histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)

probablity_plot.png

对变量取log转换得:

1
2
#applying log transformation
df_train['SalePrice'] = np.log(df_train['SalePrice'])

Theoretical_quantiles.png

可以看到,散点更为均匀地分布在了直线的两侧。

以同样的方法对’GrLivArea’与’TotalBsmtSF’进行处理。

其中面临一个很严重的问题是,有些值为0,所以在这些值上,我们无法对它们取log。要在此处应用对数转换,我们将创建一个变量,该变量可以具有或不具有地下室的效果(二进制变量)。然后,我们将对所有非零观测值进行对数转换,而忽略那些值为零的观测值。这样,我们可以转换数据,而不会失去某些变量的影响。

1
2
3
4
5
6
7
8
9
10
11
12
13
#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index)
df_train['HasBsmt'] = 0
df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1

transform data
df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])

#histogram and normal probability plot
sns.distplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt)

Main Variables

Variable Segment Data Type Comments
GrLivArea 1 0 生活面积
TotalBsmtSF 1 0 地下室总面积
GarageArea/GarageCars 1 0 车库
YearBuilt 0 1 建造年份
CentralAir 0 1 中央空调
OverallQual 0 1 总体评价
Neighborhood 2 1 地段

Now we can make sure there 7 variables will participate in our model. And we have cleaned the data set. The final thing left to do is to get the PREDICTION.

Model: Random forest

Why use this? Idk, otherwise the blog didn’t describe the reason clearly.

The code displays below. And I have little trouble understanding the Random Forest Algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 获取数据
data_train = pd.read_csv('./train.csv')
cols = ['OverallQual','GrLivArea', 'GarageCars','TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
x = data_train[cols].values
y = data_train['SalePrice'].values
x_scaled = preprocessing.StandardScaler().fit_transform(x)
y_scaled = preprocessing.StandardScaler().fit_transform(y.reshape(-1,1))
X_train,X_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.33, random_state=42)

clfs = {
'svm':svm.SVR(),
'RandomForestRegressor':RandomForestRegressor(n_estimators=400),
'BayesianRidge':linear_model.BayesianRidge()
}
for clf in clfs:
try:
clfs[clf].fit(X_train, y_train)
y_pred = clfs[clf].predict(X_test)
print(clf + " cost:" + str(np.sum(y_pred-y_test)/len(y_pred)) )
except Exception as e:
print(clf + " Error:")
print(str(e))

cols = ['OverallQual','GrLivArea', 'GarageCars','TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
x = data_train[cols].values
y = data_train['SalePrice'].values
X_train,X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

clf = RandomForestRegressor(n_estimators=400)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred)

rfr = clf
data_test = pd.read_csv("./test.csv")
data_test[cols].isnull().sum()
data_test['GarageCars'].describe()
data_test['TotalBsmtSF'].describe()

cols2 = ['OverallQual','GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
cars = data_test['GarageCars'].fillna(1.766118)
bsmt = data_test['TotalBsmtSF'].fillna(1046.117970)
data_test_x = pd.concat( [data_test[cols2], cars, bsmt] ,axis=1)
data_test_x.isnull().sum()

x = data_test_x.values
y_te_pred = rfr.predict(x)
print(y_te_pred)

print(y_te_pred.shape)
print(x.shape)

prediction = pd.DataFrame(y_te_pred, columns=['SalePrice'])
result = pd.concat([ data_test['Id'], prediction], axis=1)
# result = result.drop(resultlt.columns[0], 1)
result.columns

# save the prediction
result.to_csv('./Predictions.csv', index=False)

[NOTE] 构建HTTPS

@hou说了许久,刚好上次申请的证书快过期了,这次把泛域名开启HTTPS的一些流程及注意事项记录一下。

UPDATE 2020/06/15

certbot 太难用了,说是自动更新,每次都不 work,换用 acme.sh

阅读更多

[Script] 脚本乱写I

Shell脚本乱写(一)

1. 自动新增用户并配置权限

要实现:

  • 读取参数:选定运行主机、用户名
  • 新建用户到组 usreadd -G
    • 为用户配置默认密码:centos中为用户配置默认密码可以使用passwd --stdin USERNAME,在脚本中可以使用echo "PASSWORD" | passwd --stdin USERNAME,但在Ubuntu中passwd并不支持--stdin,此时可以使用echo USERNAME:PASSWORD | chpasswd
    • 事先将用户组加入到/etc/sudoers
  • 用户首次登陆时提示需要修改密码,用户激活时限为2年 chage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash
# Program:
# Grand a new user with ROOT privileges.
#
# History:
# 2019/11/9 Fusidic First release
echo "Please input username"
read username
path='/root/bin/hosts'
cat /root/bin/hosts | while read line;do
echo $line
ssh -n $line "useradd ${username} -m -G AWG &"
ssh -n $line "echo "${username}:nuaacs204" | chpasswd &"
# ssh -n $line "chage -d 0 ${username} &"
done

遇到的问题:

  • while循环只执行一次,后添加了-n参数

  • 新建用户修改权限失败passwd: Authentication token manipulation error passwd: password unchanged,这是由于有权限改/etc/passwd而无权限修改/etc/shadow

  • 未解决:使用chage修改用户有效时间,以此达到让用户登陆账户立即强制改密码存在问题,改密码无效,可能是权限问题。

[NOTE] 区块链漫谈

谈到区块链技术,有个绕不过的东西,那就是比特币。比特币被称为“分布式的账单”,这是区别于银行而言的。传统货币中是有银行这个中心存在的,银行通过印发钞票、增减债券、修改准备金等金融手段调控货币,使其稳定。银行作为一个受信任的第三方,保有交易双方的交易信息,也就是其账本。

而中本聪提出比特币白皮书的提出是有一个背景的,那就是08年的金融危机,金融危机后来被认为是银行、评级机构对信用的滥用导致的。这也引发了人们的担忧:当有一天,银行变得“不靠谱”的时候,我们手上的货币还能靠谱吗?

阅读更多