root@openstack1:~/kubernetes# timedatectl
                      Local time: Tue 2019-12-17 10:30:02 UTC
                  Universal time: Tue 2019-12-17 10:30:02 UTC
                        RTC time: Tue 2019-12-17 10:30:02
                       Time zone: Etc/UTC (UTC, +0000)
       System clock synchronized: yes
systemd-timesyncd.service active: yes
                 RTC in local TZ: no

If you haven’t sychronized with internet, the value of System clock synchronize must be no

Use sudo systemctl restart systemd-timesyncd.service to activate timesyncd

root@openstack1:~/kubernetes# systemctl status systemd-timesyncd.service
● systemd-timesyncd.service - Network Time Synchronization
   Loaded: loaded (/lib/systemd/system/systemd-timesyncd.service; enabled; vendo
   Active: active (running) since Mon 2019-12-16 09:24:02 UTC; 1 day 1h ago
     Docs: man:systemd-timesyncd.service(8)
 Main PID: 27584 (systemd-timesyn)
   Status: "Synchronized to time server 91.189.94.4:123 (ntp.ubuntu.com)."
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/systemd-timesyncd.service
           └─27584 /lib/systemd/systemd-timesyncd

The final key to synchronize the clock

sudo timedatectl set-ntp true

Switch your time-zone

timedatectl list-timezones to check all the time-zones.

sudo timedatectl set-timezone Asia/Shanghai

CentOS

Using command line

# installation
yum install chrony
# enable chronyd
systemctl start chronyd
systemctl enable chronyd
# set timezone to Shanghai
timedatectl set-timezone Asia/Shanghai
# launch it
timedatectl set-ntp yes

2019-12-1514 分钟读完 (大约 2070 个字)

[NOTE] Kaggle house price prediction

Abstracts

This is a copycat of Comprehensive data explorationi with Python. Since I had limited time to accomplish AI project. So I preferred learn from other’s notebook. And ‘Comprehensive data explorationi with Python’ is apparently the most fit one for me.

According to the article, the first thing we should do is look through the whole data set, and find the most important variables which matters when you buy a house.

And then an important problem we must deal with is Data Cleaning.

Overviews

While ‘Type’ and ‘Segment’ is just for possible future reference, the column ‘Expectation’ is important because it will help us develop a ‘sixth sense’. To fill this column, we should read the description of all the variables and, one by one, ask ourselves:

Do we think about this variable when we are buying a house? (e.g. When we think about the house of our dreams, do we care about its ‘Masonry veneer type’?).
If so, how important would this variable be? (e.g. What is the impact of having ‘Excellent’ material on the exterior instead of ‘Poor’? And of having ‘Excellent’ instead of ‘Good’?).
Is this information already described in any other variable? (e.g. If ‘LandContour’ gives the flatness of the property, do we really need to know the ‘LandSlope’?).

I went through this process and concluded that the following variables can play an important role in this problem:

OverallQual 总体质量
YearBuilt.
TotalBsmtSF. 地下室面积
GrLivArea. 地上居住面积

Hmmm… It seems that ‘SalePrice’ and ‘GrLivArea’ are really old friends, with a *linear relationship.***

In my opinion, this heatmap is the best way to get a quick overview of our ‘plasma soup’ and its relationships. (Thank you @seaborn!)

TotalBsmtSF and 1stFlrSF
GarageX

According to our crystal ball, these are the variables most correlated with ‘SalePrice’. My thoughts on this:

‘OverallQual’, ‘GrLivArea’ and ‘TotalBsmtSF’ are strongly correlated with ‘SalePrice’. Check!
‘GarageCars’ and ‘GarageArea’ are also some of the most strongly correlated variables. However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area. ‘GarageCars’ and ‘GarageArea’ are like twin brothers. You’ll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep ‘GarageCars’ since its correlation with ‘SalePrice’ is higher).
‘TotalBsmtSF’ and ‘1stFloor’ also seem to be twin brothers. We can keep ‘TotalBsmtSF’ just to say that our first guess was right (re-read ‘So… What can we expect?’).
‘FullBath’?? Really?
‘TotRmsAbvGrd’ and ‘GrLivArea’, twin brothers again. Is this dataset from Chernobyl?
Ah… ‘YearBuilt’… It seems that ‘YearBuilt’ is slightly correlated with ‘SalePrice’. Honestly, it scares me to think about ‘YearBuilt’ because I start feeling that we should do a little bit of time-series analysis to get this right. I’ll leave this as a homework for you.

Let’s proceed to the scatter plots.

Missing data

Missing data analysis

Using script below, we can easily get the missing data.

#missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(20))

	Total	Percent
PoolQC	1453	0.995205
MiscFeature	1406	0.963014
Alley	1369	0.937671
Fence	1179	0.807534
FireplaceQu	690	0.472603
LotFrontage	259	0.177397
GarageCond	81	0.055479
GarageType	81	0.055479
GarageYrBlt	81	0.055479
GarageFinish	81	0.055479
GarageQual	81	0.055479
BsmtExposure	38	0.026027
BsmtFinType2	38	0.026027
BsmtFinType1	37	0.025342
BsmtCond	37	0.025342
BsmtQual	37	0.025342
MasVnrArea	8	0.005479
MasVnrType	8	0.005479
Electrical	1	0.000685
Utilities	0	0.000000

So how to handle the missing data?

We’ll consider that when more than 15% of the data is missing, we should delete the corresponding variable and pretend it never existed. So we delete ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘FireplaceQu’ and ‘LotFrontage’.

As for ‘GarageX’, they all have the same number of missing data. Maybe the missing data refers to the same set of observations. Since the most important information regarding garages is expressed by ‘GarageCars’ and considering that we are just talking about 5% of missing data, I’ll delete the mentioned ‘GarageX‘ variables. The same logic applies to ‘BsmtX‘ variables.

As for ‘MasVnrArea’(砖石饰面面积) and ‘MasVnrType’(砖石饰面种类), we can consider that these variables have a strong correlation with ‘YearBuilt’ and ‘OverallQual’ which are already considered. So we delete ‘MasVnrArea’ and ‘MasVnrType’.

Delete missing variables

We’ll delete all the variables with missing data, except the variable ‘Electrical’. In ‘Electrical’ we’ll just delete the observation with missing data.

#dealing with missing data
df_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)
df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)
df_train.isnull().sum().max() #just checking that there's no missing data missing...

If the output is ‘0’, it means you have fully delete missing data.

Out liars

The primary concern here is to establish a threshold that defines an observation as an outlier. To do so, we’ll standardize the data. In this context, data standardization means converting data values to have mean of 0 and a standard deviation of 1.

这里主要关注的是建立一个将观察值定义为异常值的阈值。为此，我们将对数据进行标准化。在这种情况下，数据标准化意味着将数据值转换为平均值为0且标准差为1。

#standardizing data
saleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);
low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]
high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]
print('outer range (low) of the distribution:')
print(low_range)
print('\nouter range (high) of the distribution:')
print(high_range)

这一步的目的应该是为了找出数据中的离群值，这里需要关注的是两个大于7的变量。

#bivariate analysis saleprice/grlivarea
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

#deleting points
df_train.sort_values(by = 'GrLivArea', ascending = False)[:2]
df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)
df_train = df_train.drop(df_train[df_train['Id'] == 524].index)

#bivariate analysis saleprice/grlivarea
var = 'TotalBsmtSF'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

将’GrlivArea’中的离群值删除。

之后考察’TotalBsmtSF’中的离群值，但它的离群值表现在可以接受的范围之内。

Getting hard core

According to Hair et al. (2013), four assumptions should be tested:

Normality: The data should look like a normal distribution.
Homoscedasticity: 这个用英文解释不太好懂，同方性是可取的，我们希望误差项在自变量的所有值上都相同；
Linearity: 正如前文已经做过的，通过散点图的方法来观测两个变量之间是否有线性的相关性，如果相关性不是线性的，那么可以通过一定的数学转换使其线性相关；
Absence of correlated errors:

Normality

The point here is to test ‘SalePrice’ in a very lean way. We’ll do this paying attention to:

Histogram - Kurtosis and skewness.
Normal probability plot - Data distribution sould closely follow the diagonal that represents the normal distribution.

#histogram and normal probability plot
sns.distplot(df_train['SalePrice'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)

对变量取log转换得：

1 2	#applying log transformation df_train['SalePrice'] = np.log(df_train['SalePrice'])

可以看到，散点更为均匀地分布在了直线的两侧。

以同样的方法对’GrLivArea’与’TotalBsmtSF’进行处理。

其中面临一个很严重的问题是，有些值为0，所以在这些值上，我们无法对它们取log。要在此处应用对数转换，我们将创建一个变量，该变量可以具有或不具有地下室的效果（二进制变量）。然后，我们将对所有非零观测值进行对数转换，而忽略那些值为零的观测值。这样，我们可以转换数据，而不会失去某些变量的影响。

#create column for new variable (one is enough because it's a binary categorical feature)
#if area>0 it gets 1, for area==0 it gets 0
df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index)
df_train['HasBsmt'] = 0 
df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1

transform data
df_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])

#histogram and normal probability plot
sns.distplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm);
fig = plt.figure()
res = stats.probplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt)

Main Variables

Variable	Segment	Data Type	Comments
GrLivArea	1	0	生活面积
TotalBsmtSF	1	0	地下室总面积
GarageArea/GarageCars	1	0	车库
YearBuilt	0	1	建造年份
CentralAir	0	1	中央空调
OverallQual	0	1	总体评价
Neighborhood	2	1	地段

Now we can make sure there 7 variables will participate in our model. And we have cleaned the data set. The final thing left to do is to get the PREDICTION.

Model: Random forest

Why use this? Idk, otherwise the blog didn’t describe the reason clearly.

The code displays below. And I have little trouble understanding the Random Forest Algorithm.

# 获取数据
data_train = pd.read_csv('./train.csv')
cols = ['OverallQual','GrLivArea', 'GarageCars','TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
x = data_train[cols].values
y = data_train['SalePrice'].values
x_scaled = preprocessing.StandardScaler().fit_transform(x)
y_scaled = preprocessing.StandardScaler().fit_transform(y.reshape(-1,1))
X_train,X_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.33, random_state=42)

clfs = {
        'svm':svm.SVR(),
        'RandomForestRegressor':RandomForestRegressor(n_estimators=400),
        'BayesianRidge':linear_model.BayesianRidge()
       }
for clf in clfs:
    try:
        clfs[clf].fit(X_train, y_train)
        y_pred = clfs[clf].predict(X_test)
        print(clf + " cost:" + str(np.sum(y_pred-y_test)/len(y_pred)) )
    except Exception as e:
        print(clf + " Error:")
        print(str(e))

cols = ['OverallQual','GrLivArea', 'GarageCars','TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
x = data_train[cols].values
y = data_train['SalePrice'].values
X_train,X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

clf = RandomForestRegressor(n_estimators=400)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred)

rfr = clf
data_test = pd.read_csv("./test.csv")
data_test[cols].isnull().sum()
data_test['GarageCars'].describe()
data_test['TotalBsmtSF'].describe()

cols2 = ['OverallQual','GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt']
cars = data_test['GarageCars'].fillna(1.766118)
bsmt = data_test['TotalBsmtSF'].fillna(1046.117970)
data_test_x = pd.concat( [data_test[cols2], cars, bsmt] ,axis=1)
data_test_x.isnull().sum()

x = data_test_x.values
y_te_pred = rfr.predict(x)
print(y_te_pred)

print(y_te_pred.shape)
print(x.shape)

prediction = pd.DataFrame(y_te_pred, columns=['SalePrice'])
result = pd.concat([ data_test['Id'], prediction], axis=1)
# result = result.drop(resultlt.columns[0], 1)
result.columns

# save the prediction
result.to_csv('./Predictions.csv', index=False)

2019-11-22Operations14 分钟读完 (大约 2127 个字)

[NOTE] 构建HTTPS

@hou说了许久，刚好上次申请的证书快过期了，这次把泛域名开启HTTPS的一些流程及注意事项记录一下。

UPDATE 2020/06/15

certbot 太难用了，说是自动更新，每次都不 work，换用 acme.sh 。

2019-11-11Operations2 分钟读完 (大约 317 个字)

[Script] 脚本乱写I

Shell脚本乱写（一）

1. 自动新增用户并配置权限

要实现：

读取参数：选定运行主机、用户名
新建用户到组 usreadd -G
- 为用户配置默认密码：centos中为用户配置默认密码可以使用passwd --stdin USERNAME，在脚本中可以使用echo "PASSWORD" ｜ passwd --stdin USERNAME，但在Ubuntu中passwd并不支持--stdin，此时可以使用echo USERNAME:PASSWORD | chpasswd
- 事先将用户组加入到/etc/sudoers中
用户首次登陆时提示需要修改密码，用户激活时限为2年 chage

#!/bin/bash
# Program:
# 	Grand a new user with ROOT privileges.
#
# History:
# 	2019/11/9  Fusidic   First release
echo "Please input username"
read username
path='/root/bin/hosts'
cat /root/bin/hosts | while read line;do
  echo $line
  ssh -n $line "useradd ${username} -m -G AWG &"
  ssh -n $line "echo "${username}:nuaacs204" | chpasswd &"
#  ssh -n $line "chage -d 0 ${username} &"
done

遇到的问题：

while循环只执行一次，后添加了-n参数
新建用户修改权限失败passwd: Authentication token manipulation error passwd: password unchanged，这是由于有权限改/etc/passwd而无权限修改/etc/shadow
未解决：使用chage修改用户有效时间，以此达到让用户登陆账户立即强制改密码存在问题，改密码无效，可能是权限问题。

2019-10-28essay43 分钟读完 (大约 6509 个字)

[NOTE] 区块链漫谈

谈到区块链技术，有个绕不过的东西，那就是比特币。比特币被称为“分布式的账单”，这是区别于银行而言的。传统货币中是有银行这个中心存在的，银行通过印发钞票、增减债券、修改准备金等金融手段调控货币，使其稳定。银行作为一个受信任的第三方，保有交易双方的交易信息，也就是其账本。

而中本聪提出比特币白皮书的提出是有一个背景的，那就是08年的金融危机，金融危机后来被认为是银行、评级机构对信用的滥用导致的。这也引发了人们的担忧：当有一天，银行变得“不靠谱”的时候，我们手上的货币还能靠谱吗？