## Blog 7

As the end of summer sessions on campus approaching, we are keep making progress on the research but meanwhile facing more and more difficulties. To processing HBM in R is one of them. Since Hierarchical Bayesian Model is still on its way to develop, it is kind of hard and time-consuming to find some specific functions that fit right into our model. There are so many arguments and different functions in the package to explore. While doing more and talking more to the professor about it, I also start on learning about the validation part of random forest.

## Blog 6

I still remember that I felt half nervous half excited while waiting for the “predict” function to finish, and I felt thrilled while seeing the map with the predicted values. Although we still have to add some more variables to improve the accuracy of the data and population description map, this first step — the template making process— is crucial. After discussing with Professor Frazier, we plan to go further to explore some Mathematical model first and use the sample data from DHS website to create a more continuous and detailed population datasets. One of the most promising method is Hierarchical Bayesian Model (HBM) for this use. We figure out a general outline of our later work. Since we have already had all the layers and run the data with random forest to have clan population prediction. Now we are going to collect some more specific household data from DHS, like the household size and particular locations of them, and put the data into HBM which may involve some multivariable logistic regression models. These kinds of models will have some input which are discrete point data and generate some output which are predicted continuous data. This step will be noteworthy for this whole research. Also, for our research, we are not going to use individual data from DHS but household data. This is a noteworthy part of our study. Some of the other studies before have investigated individual information and map that onto the map. However, there are many aspects of life which are much more strongly related to factors of household. For example, the places where people choose to live is highly depended on where the labor of the family work. Household, or in other words, family is highly important for mapping population and population description. Only several research take household size and related information into account so that our research will be a breakthrough. However, there are also some technical issues waiting to be addressed and most of them will require some more advanced mathematical models. These are mostly what we talked about and are researching on during this period of time. We are studying more and more about the promising method HBM and keep encountering new problems.

## blog 5

After generating random forest model, which is our first milestone. we are going to use it as a tool to predict the population distribution. First off, we have to replace all the data which has the value of zero to speed up the later process by using “replace_na”, and then drop some layers that has no help for the predicting model. These are all the necessary steps to reorganise the data and make the predicting process much faster and more efficient. We also found that “raster” package is a very powerful package for us to aggregate and tidy up the data. There are many useful and common functions either from it or evolve from it. Since the data set is pretty big and the model will take some time to process the data, I learned many skills from the professor and I also talked to some experts in IT office of the college. For this particular process, we used “beginCluster” and “endCluster”, which can speed up the process. Also, in other cases, we might use public computer of our college to create some nodes and files instead of using our own computer. The HPC functions will make the process much faster but it will take some effort to look at the plots and the result of some lines of code since it will have to run the code all together at the one time.

## Blog 4

The time flies so fast, though we are a little bit behind schedule for now, we already have a clear plan ahead. For now, we are going to create a template and analyse the data we have already downloaded first. I have to keep reading paper about random forest to see how it works underneath the model itself. I have read more about strapping method, tree creating process, how many trees to create, and such and such. Before setting up the random forest model, we are going to see whether all the sub-variables of the variable — land cover, nighttime light, and settlement — are important to the population distribution that we care about. We use RMSE to make an importance plot of the variables and delete the sub variable data which has the importance zero. For landcover, the “snow” variable has been deleted from the data set since it doesn’t have any affect on the result and might have a bad affect on the time while running the model. Then after sorting through all the data we downloaded, we are going to put the data into the “random forest” model to create a Large randomForest formula. From there, I also create a plot called variable importance plot to see which variable is the most crucial for our project. The result is that “urban” is the most important one, and “water.permanent” is the least importance one.

## Blog 3

As the summer session progressed, I am going to complete the first stage of the research — data collection. This stage is especially important not only because it is the start of the project but also because the accuracy of the data will directly affect the accuracy of the result, no matter how hard we try to investigate and improve the methods. As for starters, we plan to create a template for the research first. Based on the other papers I have read, one of the most important variable is land cover, which includes bushes, rain precipitation, snow, etc. It is directly related to the distribution of population and the habitability of the region. The data of land cover is downloaded from xxx. Before conducting research myself, “data-collection” process is kind of fixed and simple in my mind, but there are many factors to be considered when I am actually searching for data of the best quality. Since we plan to make a template first, so we are going to use only a few variables for now. After downloading land cover data, based on the papers I have read, two of the other important variables are nighttime light and settlement. Since in this research, one of the main idea is that we are going to investigation how to generate a synthetic population based on various data layers available while using household as a unit instead of using individual as a unit like most of the other people do. Household is a crucial point while thinking about population distribution and even disease spread. In this case, settlement is an important variable. After collecting all these variables, we have to make sure that the data collecting time, region, and organization is corresponding to each other. Then another important step is to organise the data before aggregate all the variables together. We have to get the general population information of Liberia. And based on that data, we have to aggregate individual or household data into clan data based on clan ID. In other words, we are going to get variable data for each clan. From there, we can aggregate all  the variable data based on clan ID using “merge” function. After working with the variable, this week just passed by.