learn linear regression in R/python

There are many ways to learn linear regression and even there are plenty of resources available for all of the different approaches and one of the easiest I found is learning linear regression by using central limit theorem and probability.

I am strongly influenced by the approach taught to me in my university and as well the book I am following applied regression analysis:: a second course in business and economic statistics, and all my knowledge and concepts are borrowed from there.

I am mentioning the resource you can look upto if you want to become fluent in linear regression. It is a heavy read and long book so I don’t expect you to master it. I am just suggesting a resource here may be if you are finding the concepts tough or if you think you can’t go through the book may be you should follow some other resources.

applied regression analysis

If you are fluent in central limit theorem and static tests on two samples you can directly read the book from 3rd chapter.It is a little tough concept if you are not good enough especially with central limit theorem . So I strongly advice you to watch videos on central limit theorem from khan academy

I advice you to stick to an approach instead of learning both. May be you can learn both but it’s not required because anyways your major criteria should be how effectively you use linear regression by using python/R so learning the concept is just to know how the linear regression works after all you will solve your problems using python/r code in few commands using linear_model from sklearn and lm() respectively and the resource to learn linear regression in python is here

linear regression

sklearn is very helpful for all the machine learning languages in python and very easy read for beginners.The linear regression code in python is very simple.
and to learn the code in R just enter ?lm() in R console or you can read the documentation mentioned here

linear regression in r

I also advice to solve problems based on linear regression. Take a simple sample data set and do linear regression and try to plot your results and you can get many data sets online prefer taking a siple dataset so that you could directly proceed to doing regression on the data without much data manipulation.
for the data sets you can try

uci machine learning datasets

Now you are all done and you should learn about the data manipulation because your data is not generally suitable for applying linear regression directly. you need to get your data into a correct form for that and there will be many techniques and hopefully I will explain them in the series of posts and it is the toughest part of data analysis because each dataset will have a different approach and there are millions of data sets out there

Advertisements

EDA on pm25 data

I had done an exploratory data analysis on pm25 dataset regarding the airpollution in the USA using ‘R’
1)mapped the points in usa which are most polluted and also least polluted
2)mapped the fips data to counties with the help of another dataset county.fips

obj15&obj$region==’west’),c(‘longitude’,’latitude’,’fips’)]
#pm2.5 below outliers longitude and latitude
pm25less6=obj[(obj$pm25<6&obj$region=='east'),c('longitude','latitude','fips')]
#identifying counties
df=data.frame(matrix(nrows=576,ncols=5))
df=merge(obj,county.fips,by='fips')
#counties having pm25 greater than 15
counties_above=df[df$fips %in% pm25above15$fips,'polyname']
#counties having pm25 less than 6
counties_below=df[df$fips %in% pm25less6$fips,]
#drawing the lon,lat locations on map
map=get_map(location=c(mean(rbind(pm25above15,pm25less6)$longitude),mean(rbind(pm25above15,pm25less6)$latitude)+15),zoom=3,maptype='terrain',scale=1,crop = TRUE)
ggmap(map)+geom_point(data=rbind(pm25above15,pm25less6),aes(x=longitude,y=latitude,fill='red'))

Rplot01